Geocoding User Queries
Total Page:16
File Type:pdf, Size:1020Kb
Geocoding User Queries Dipl.-Inf. Konstantin Clemens ORCID: 0000-0003-1892-9574 an der Fakultät VII – Wirtschaft und Management der Technischen Universität Berlin zur Erlangung des akademischen Grades Doktor der Ingenieurwissenschaften - Dr.-Ing. - genehmigte Dissertation Promotionsausschuss: Tag der wissenschaftlichen Aussprache: Vorsitzender: 10. Dezember 2019 Prof. Dr. Manfred Hauswirth Gutachter: Prof. Dr. Axel Küpper Prof. Dr.-Ing. Jochen Schiller Prof. Dr.-Ing. David Bermbach Berlin 2020 Technische Universität Berlin Fakultät IV Service-centric Networking Doctoral Thesis Geocoding User Queries Author: Supervisors: Konstantin Clemens Prof. Dr. Axel Küpper Prof. Dr.-Ing. Jochen Schiller Prof. Dr.-Ing. David Bermbach iii Declaration of Authorship I, Konstantin Clemens, declare that this thesis titled, “Geocoding User Queries” and the work presented in it are my own. I conrm that: • This work was done wholly or mainly while in candidature for a research degree at this University. • Where any part of this thesis has previously been submitted for a degree or any other qualication at this University or any other institution, this has been clearly stated. • Where I have consulted the published work of others, this is always clearly attributed. • Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. • I have acknowledged all main sources of help. • Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: v Abstract While human users refer to locations using toponyms and addresses, computers rely on latitude and longitude coordinates, or similar, numerical encoding of location information. Geocoding is the process of resolving such toponyms and addresses into machine-friendly numerical encoding. Thus, wherever human users need to specify a location to a computer, geocoding is involved. The process, thereby, is mainly dependent on two aspects. Firstly, the underlying address data needs to be complete, up-to-date, and accurate as otherwise, the geocoding result will not be precise. Secondly, the algorithms that process the user input query, browse candidates within the data and select the right result to the query need to be capable of handling queries of human users. In this thesis, the specics of human input addresses are tackled. Various kinds of modications occur to an address when humans are involved. For example, address formats are not adhered to, superuous tokens are specied, while required parts of an address are left out of the query. Also, humans abbreviate, escape, or misspell address element names often. The goal of this thesis is to investigate the algorithmic aspect of geocoding systems. Throughout the thesis, ways to process, organize, index, and rank address data are investigated independently of the underlying data. As a result, a method for creating a geocoding system with good precision and recall metrics even in the face of the human factor is developed. A sequence of experiments is executed that gradually build on top of each other. First, the use of a generic document search engine as the index of a geocoding system is validated. Next, a suitable address model for indexing and searching is selected. Then, using log data from a live geocoding system, a statistical model is created that can be used to generate user-like requests. A similar approach based on the log data is evaluated to create geocoding systems that perform measurably better than geocoding systems relying on common methods to handle user queries. Finally, a process is developed that continuously improves a geocoding system using user queries issued against it. The measurements prove that the path chosen in this thesis is a viable method to create geocoding systems that can handle user queries better than geocoding systems relying on common techniques. vii Zusammenfassung Menschen benutzen Adressen um Standorte zu beschreiben. Computer hingegen benutzen Längen- und Breitengrade, oder ähnliche numerische Kodierungen dafür. Den Prozess der Umwandlung einer Adresse in Längen- und Breitengrade nennt man Geocoding. Jedes Mal, wenn ein Mensch einem Computer einen Standort nennt, wird dieser Prozess ausgeführt. Dabei sind zwei Aspekte besonders wichtig für das Geocoding. Erstens, die von dem Prozess benutzen Daten müssen aktuell, vollständig und genau sein da sonst das Ergebnis ungenau sein kann. Zweitens müssen die von dem Prozess benutzen Algorithmen mit menschlichen Eingaben umgehen zu können. Diese sind dafür zuständig die Eingabe zu verarbeiten, die Daten zu durchsuchen und das richtige Ergebnis auszuwählen. Verschiedene Veränderungen der eigentlichen Adresse kommen vor, wenn Menschen diese von Hand eingeben. Zum Beispiel halten sich Menschen oft nicht an geltende Formate und spezizieren überüssige Entitäten während sie notwendige Elemente von Adressen weglassen. Außerdem benutzen Menschen manchmal Abkürzungen, lassen Umlaute oder Akzente weg, oder machen schlicht Fehler, wenn sie eine Adresse eingeben. Das Ziel dieser Arbeit ist es den algorithmischen Teil von Geocoding Systemen zu untersuchen. In dieser Dissertation werden Methoden zum Vorbereiten, Organisieren, Indizieren, und Sortieren von Adressdaten unabhängig von den Daten selbst erforscht. Das Ergebnis ist ein Verfahren zum Erstellen von Geocoding Systemen, die sogar angesichts des menschlichen Faktors eine gute Genauigkeit und Treerquote aufweisen. Eine Serie von Experimenten, die auf einander aufbauen, wird dazu ausgeführt. Zunächst wird überprüft, dass eine generische Suchmaschine als Kern eines Geocoding Systems eingesetzt werden kann. Danach wird ein passendes Modell für die Indizierung und Suche von Adressen ausgewählt. Als nächstes wird ein Protokoll eines Geocoding Systems im oenen Betrieb dazu benutzt ein statistisches Modell zu erstellen. Mit dessen Hilfe können Geocoding Anfragen generiert werden, die ähnlich sind zu Anfragen von realen Menschen. Ein ähnlicher Ansatz wird benutzt um ein Geocoding System aufzusetzen, das menschliche Anfragen messbar besser verarbeiten kann als Geocoding Systeme mit gewöhnlichen Methoden. Schließlich wird ein Prozess erarbeitet, der Geocoding Systeme kontinuierlich verbessert. Dazu werden Anfragen benutzt die das System selbst verarbeiten muss. Messergebnisse belegen, dass die in dieser Dissertation vorgeschlagene Vorgehensweise eine plausible Methode ist, um Geocoding Systeme aufzusetzen, die Anfragen von Menschen besser verarbeiten können als Geocoding Systeme mit herkömmlichen Ansätzen. ix Related Publications Many of ideas, methods, and results collected in this thesis have been published in scientic papers and journals and presented at international conferences. To some extent, therefore, parts of this thesis are contained in the following publications: [1] Clemens, K. Automated processing of postal addresses. In GEOProcessing 2013: The Fifth International Conference on Advanced Geographic Information Systems, Applications, and Services (2013). IARIA. [2] Clemens, K. Geocoding with OpenStreetMap Data. In GEOProcessing 2015: The Seventh International Conference on Advanced Geographic Information Systems, Applications, and Services (2015). IARIA. [3] Clemens, K. Qualitative Comparison of Geocoding Systems using OpenStreetMap Data. In ThinkMind: International Journal On Advances in Software, volume 8, numbers 3 and 4, 2015. IARIA. [4] Clemens, K. Comparative Evaluation of Alternative Addressing Schemes. In GEOProcessing 2016: The Eighth International Conference on Advanced Geographic Information Systems, Applications, and Services (2016). IARIA. [5] Clemens, K. Enhanced Address Search with Spelling Variants. In GISTAM 2018: 4th International Conference on Geographical Information Systems Theory, Application and Management (2018). SCITEPRESS. [6] Clemens, K. Indexing Spelling Variants for Accurate Address Search. In: Geographical Information Systems Theory, Applications and Management. 4th International Conference, GISTAM 2018, Funchal, Madeira, Portugal, March 17-19, 2018, Revised Selected Papers (2019). Springer. [7] Clemens, K. Deriving Spelling Variants from User Queries to Improve Geocoding Accuracy. In GISTAM 2019: 5th International Conference on Geographical Information Systems Theory, Application and Management (2019). SCITEPRESS. xi Acknowledgments First of all, I would like to thank my supervisor Prof. Dr. Axel Küpper for providing me with trust and space at times and pushing me to deliver when slowed down. As an external doctorate student, I was able to only spend parts of my time with the research group Service-centric Networking. Even though, I received a lot of support from my fellow research colleagues too, who spent a lot of time discussing with me and reviewing my publications. Dear SNET colleagues, thank you very much for your input. I also understand the time and eort needed to read a body of work as this thesis, provide feedback and grade it. Therefore, my thank also goes to the committee of censors, including Prof. Dr. Axel Küpper, Prof. Dr.-Ing. Jochen Schiller, as well as Prof. Dr.-Ing. David Bermbach. Another big source of support was my former employer HERE Technologies and the many supervisors that I had the opportunity to work with while working at this thesis. They supported me along with my terric colleagues who always had my back while I was doing the research for this thesis. A special thank you goes to