Improving the Openstreetmap Data Set Using Deep Learning
Total Page:16
File Type:pdf, Size:1020Kb
Improving the OpenStreetMap Data Set using Deep Learning Hampus Londögård Hannah Lindblad [email protected] [email protected] June 19, 2018 Master’s thesis work carried out at ÅF Digital Solutions. Supervisors: Pierre Nugues, [email protected] Thomas Hermansson, [email protected] Examiner: Jacek Malec, [email protected] Abstract OpenStreetMap is an open source of geographical data where contributors can change, add, or remove data. Since anyone can contribute, the data set is prone to contain data of varying quality. In this work, we focus on three approaches for correcting Way component name tags in the data set: Correct- ing misspellings, flagging anomalies, and generating suggestions for missing names. Today, spell correction systems have achieved a high correction accuracy. However, the use of a language context is an important factor to the success of these systems. We present a way for performing spell correction without context through the use of a deep neural network. The structure of the network also makes it possible to adapt it to a different language by changing the training resources. The implementation achieves an F1 score of 0.86 (ACR 0.69) for Way names in Denmark. Keywords: MSc, Machine Learning, OpenStreetMap, Random Forest, Neural Net- work, Sequence-2-Sequence 2 Acknowledgements We would like to thank Pierre Nugues for the feedback he has given us on this thesis. We would also like to thank Marcus Klang for his invaluable input on neural networks and Daniel Palmqvist for his feedback and knowledge of algorithms and geographical in- formation systems. Last but not least, we would like to thank ÅF and Thomas Hermansson for giving us the opportunity to work on this thesis. 3 4 Contents 1 Introduction 9 1.1 Background . .9 1.2 Purpose . 10 1.3 Problem Formulation . 10 1.3.1 Challenges . 11 1.3.2 Constraints . 11 1.4 Related Work . 11 1.5 Outline . 12 1.6 Work Distribution . 12 2 Theory 13 2.1 Introduction to Machine Learning . 13 2.1.1 Central Concepts . 14 2.1.2 Bootstrapping Data . 15 2.2 Decision Trees . 15 2.3 Ensemble Learning . 15 2.3.1 Random Forest . 15 2.3.2 XGBoost . 16 2.4 Neural Networks . 17 2.4.1 Loss . 18 2.4.2 Gradient Descent . 18 2.4.3 Hyperparameters . 19 2.4.4 Optimization Algorithms and Updaters . 20 2.5 Recurrent Neural Networks . 21 2.5.1 Back Propagation . 22 2.5.2 LSTM . 23 2.5.3 Bidirectional RNNs and BiLSTM . 23 2.6 Autoencoder . 24 2.7 Compound Words in Danish . 25 2.8 OpenStreetMap and Tags . 26 5 CONTENTS 2.9 Damerau-Levenshtein Distance . 27 2.10 Evaluation Metrics . 27 2.10.1 F-measure . 27 2.10.2 Confusion Matrix . 28 2.10.3 Accurate Correction Rate . 29 3 Approach 31 3.1 CRISP-DM . 31 3.2 Method . 33 3.2.1 Pre-work . 33 3.2.2 Implementation and Evaluation . 34 3.3 Tools . 37 3.3.1 Atlas . 37 3.3.2 DeepLearning4Java . 37 3.3.3 Algorithmic Spell Correction . 38 4 Implementation 41 4.1 S/SS Spell Checking . 41 4.1.1 Experimental Setup . 41 4.1.2 Baseline Implementation . 42 4.1.3 Random Forest Implementation . 42 4.1.4 Feed Forward Neural Network Implementation . 43 4.1.5 Results . 43 4.2 Way Name Spell Checking . 43 4.2.1 Experimental Setup . 43 4.2.2 RNN and BiRNN Implementations . 44 4.2.3 Results . 44 4.3 Speed Limit Anomalies . 46 4.4 Name and Tag Anomalies . 46 4.4.1 Experimental Setup . 46 4.4.2 RNN Implementation . 47 4.4.3 XGBoost Implementation . 48 4.4.4 Results . 48 4.5 Missing Names . 48 4.5.1 Results . 49 5 Discussion 51 5.1 Interpretation of Results . 51 5.2 Spell Checking Results . 51 5.2.1 Difference Between Test and Reality . 52 5.2.2 Prediction Errors . 52 5.2.3 Misspelled Names in Estonia . 53 5.3 Improving Uncertain Predictions . 53 5.3.1 Propagation . 54 5.3.2 SymSpell . 54 5.4 Saturation of the Network . 54 5.5 Reduction of Aggression . 55 6 CONTENTS 5.6 Weight of Suffix Extraction . 55 5.7 Name and Tag Anomalies . 56 6 Conclusions 57 7 Future Work 59 7.1 History log . 59 7.2 Improving the Data Sets . 59 7.3 Meta Information . 60 7.4 Alphabet . 60 7.5 Data Set Balancing . 60 Appendix A Missing Name Figures 67 Appendix B Generating Danish 71 7 CONTENTS 8 Chapter 1 Introduction In this chapter, we introduce the motivations for this thesis. The first two sections present the background and purpose. We then continue to cover the research questions, related work, outline and work distribution. 1.1 Background OpenStreetMap (OSM) is an initiative to create a free, editable map of the world. The project maintains an open data set of geographical data which can be used for any purpose as long as OpenStreetMap is credited. Anyone can become a contributor and add data to the project. The OSM data set consists of three basic components: Nodes, Ways, and Relations. An example of a component is the Way component which is a list of between 2 and 2000 nodes. Way components are used to represent linear features like rivers and roads (Open- StreetMap, 2017). The components can also have tags attached to them. A tag consists of key and value fields in free text format. The tags are used to describe the elements they are attached to. An example of a tag is the name tag which is used to convey the name of a component. The city of Copenhagen is for instance a node with the key-value name tag name=København. Conventions for the meaning and use of tags have been agreed upon by the OSM com- munity. These are however not enforced by any particular entity and thus the data contained in the OSM data set is of varying quality. In the name tag case, there are occurrences of misspelled, missing, and incorrect names. For Way components, the names should match the street name depicted on the street sign for the road in question (OpenStreetMap, 2018). 9 1. Introduction 1.2 Purpose In this thesis, we focus on name tags for Way components in the OSM data set of Den- mark. Our goal is to predict if the instances are correct or incorrect given the name tag, potentially in combination with other tags like maximum speed, highway type, and sur- face. If possible, suggestions should also be generated in combination with the prediction of an incorrect instance. The suggestions could then be used for manual correction of the instance. With the resulting algorithm, we aim to assist the ongoing work of improving the qual- ity of the data contained in the OSM data set. We intend to integrate the end result into the open source tool Atlas Checks (Apple Inc, 2018), which tests Atlas file data integrity by traversing the OSM data set and applying different kinds of checks. An example of a check that is performed today is a test that finds duplicate Nodes. By integrating the work into Atlas Checks, the data set can be monitored for erroneous data without the need for manually searching the data. 1.3 Problem Formulation We aim to improve the quality of the OSM data set by increasing the number of accurate Way components and hence smooth the data set. This should be achieved by flagging pos- sibly incorrect instances for manual correction. By requiring manual correction instead of performing changes automatically, we lower the expectations on the system. The proposed solution should answer the following questions: 1. How to identify misspelled names in Way components? 2. How to generate suggestions for a misspelled name? 3. How to generate suggestions for Way components with missing names? 4. Is it possible to find anomalies in names in combination with other tags like maximal speed? Short motivations for the different research questions are listed below. • A quality issue in the data is the occurrence of misspelled names. In order to correct these instances, they first need to be identified. • When a misspelled instance has been identified, a suggestion for a correction should also be proposed by the system to simplify the correction process. • Another issue is missing names for Way instances. A Way is considered to have a missing name if neither the name tag nor the noname tag is present. These instances should be flagged and suggestions provided algorithmically. • The last step in our attempt to smooth the data set is by identifying anomalies in the data. One relation could be the Danish word for street (“gade”) and the maximum speed limit of 50 km/h or less. If a Way name’s suffix is “gade”, it’s 25 times more probable to have a max speed limit of max 50 km/h in comparison to a Way with the suffix “vej”. By taking advantage of relations like these, anomalies could possibly be identified and flagged. 10 1.4 Related Work 1.3.1 Challenges Given related work and research, it is not reasonable to expect ideal performance from a spell corrector without annotated data nor context. Whitelaw et al. (2009) showed the possibility of creating a spell corrector without annotated data by using the frequency of words on the web. However, they used the context of a sentence to improve their perfor- mance. In our case of Way name tags, it is often not possible to find the name on the Web, except for in other map services such as Google Maps.