DISSERTATION Submitted to the Combined Faculty for the Natural Sciences and Mathematics of Heidelberg University, Germany for the Degree of Doctor of Natural Sciences
Total Page:16
File Type:pdf, Size:1020Kb
DISSERTATION submitted to the Combined Faculty for the Natural Sciences and Mathematics of Heidelberg University, Germany for the degree of Doctor of Natural Sciences Put forward by M.A. Hui Li Born in: Nanjing, Jiangsu, China Oral examination:.................................................. Social Network Extraction and Exploration of Historic Correspondences Advisor: Prof. Dr. Michael Gertz Abstract Historic correspondences, in the form of letters, provide a scenario in which historic figures and events are reflected and thus play a ubiquitous role in the study of history. Confronted with the digitization of thousands of historic letters and motivated by the potentially valuable insights into history and intuitive quan- titative relations between historic persons, researchers have recently focused on the network analysis of historic correspondences. However, most related research constructs the correspondence networks only based on the sender-recipient relation with the objective of visualization. Very few of them have proceeded beyond the above stage to exploit the detailed modeling of correspondence networks, let alone to develop novel concepts and algorithms derived from network analysis or formal approaches to the data uncertainty issue in historic correspondence. In the context of this dissertation, we develop a comprehensive correspondence network model, which integrates the personal, temporal, geographical, and topic information extracted from letter metadata and letter content into a hypergraph structure. Based on our correspondence network model, we analyze three types of person-person relations (sender-recipient, co-sender, and co-recipient) and two types of person-topic relations (author-topic and sender-recipient-topic) statically and dynamically. We develop multiple measurements, such as local and global reciprocity for quantifying reciprocal behavior in weighted networks, and the topic participation score for quantifying interests or the focus of individuals or real-life communities. We investigate the rising and the fading trends of topics in order to find correlations among persons, topics, and historic events. Furthermore, we develop a novel probabilistic framework for refinement of uncertain person names, geographical location names, and temporal expressions in the metadata of historic letters. We conduct extensive experiments using letter collections to validate and evaluate the proposed models and measurements in this dissertation. A thorough discussion of experimental results shows the effectiveness, applicability and advantages of our developed models and approaches. iii Zusammenfassung Historische Korrespondenzen in Briefform stellen ein Szenario dar, in dem his- torische Figuren und Ereignisse dokumentiert werden und spielen eine zentrale Rolle in den Geschichtsstudien. Aufgrund der Digitalisierung von tausenden historischen Briefen und motiviert durch potenzielle Einblicke in die Geschichte und die quantitativen Beziehungen zwischen historischen Personen, haben sich Forscher in letzter Zeit auf die Netzwerkanalyse historischer Korrespondenzen konzentriert. Die meisten dieser Forschungen konstruieren Korrespondenz- netzwerke jedoch nur auf Grundlage von Sender-Empfänger-Beziehungen mit dem Ziel der Visualisierung. Nur sehr wenige Arbeiten darüber hinaus und nutzen die detaillierte Modellierung von Korrespondenznetzwerken aus, um neue Konzepte und Algorithmen aus der Netzwerkanalyse zu verwenden oder Ansätzen zum Datenunsicherheitsproblem in historischer Korrespondenzen zu entwickeln. Im Kontext dieser Dissertation wird ein umfassendes Korrespondenznetzwerk- modell vorgestellt, das persönliche, zeitliche, geografische und thematische Informationen aus Briefmetadaten und Briefinhalten in eine Hypergraphstruktur integriert. Basierend auf dieser Grundlage werden drei Typen von Personen- Personen-Beziehungen (Sender-Empfänger, Mitsender, und Mitempfänger) und zwei Typen von Person-Themen-Beziehungen (Autor-Thema und Sender- Empfänger-Thema) analysiert. Verschiedene Messverfahren werden entwickelt, wie lokale und globale Reziprozität zur Quantifizierung der reziproken Verhaltens in gewichteten Netzwerken, sowie ein Topic Participation Score zur Quan- tifizierung von Interessen oder Foki von Einzelpersonen bzw. realen Communities. Zunehmende und abnehmende Trends in Topics dienen dazu, die Korrelationen zwischen Personen, Themen und historischen Ereignissen zu identifizieren. Außer- dem wird ein neuartiges probabilistisches Rahmenwerk zur Verfeinerung von unsicheren Personennamen, geografischen Ortsnamen und zeitlichen Ausdrücken in Brief-Metadaten entwickelt. Anhand von umfangreichen Experimenten mit großen historischen Briefsamm- lungen werden die vorgeschlagenen Modelle und Messungen in dieser Disserta- tion validiert und evaluiert. Eine detaillierte Diskussion der Ergebnisse der Ex- perimente dient dazu, die Effektivität, die Anwendbarkeit, und die Vorteile der entwickelten Modelle und Ansätze aufzuzeigen. iv Acknowledgements This dissertation could not be considered completed if I do not acknowledge the assis- tance, patience, and support of many individuals and institutions. First and foremost, I would like to express my deepest gratitude to my supervisor, Prof. Dr. Michael Gertz, of the Faculty of Mathematics and Computer Science at Heidelberg University, for giving me the honor to come to his group and inspiring me with the prime research overview that leads to the possibility of the research of this dissertation. Not only that, I would like to thank him for imparting me his valuable advice, profound knowledge and tremendous insights, as well as his firm support and enduring patience to my research progress and academic writing over the years. Before I came to the Database Research System Group in 2013, I was a casual student who had trivial knowledge about my late research. But that did not deter my supervisor from being understanding, responsible and compassionate in supervising me with my research. During the past four years, I have met many problems and difficulties, however, the support and guidance of my supervisor helped me stay committed to research and forging through the difficult times to the final part of the dissertation. I sincerely appreciate that he took time out of his busy schedule to go through the draft of my dissertation and met me after only a few days with comments and suggestions on almost every page. My supervisor sets a great example of excellence as a researcher to me and I sin- cerely thank him for his kindness and tolerance to my study from the bottom of my heart. I would like to express my gratitude to my second supervisor, Prof. Dr. Stefan Riezler, for his warm-hearted support in my research, the insightful comments I have received, as well as the fruitful weekly colloquiums and delicious cakes he has provided. Besides my supervisors, I am also grateful to the rest of the committee members PD Dr. Wolfgang Merkle and Prof. Dr. Gerhard Reinelt for the genuine comments I will receive in the defense. I would like to express my gratitude to the Heidelberg Graduate School of Mathemat- ical and Computational Methods for the Sciences (HGS MathComp) for the financial support so that this dissertation has become possible. I am also grateful to Dr. Michael Winckler at HGS MathComp, for giving me consistent encouragement and useful advice upon my doctoral research. I would also like to express my gratitude to all the following institutions that provide data sources for my research that made my dissertation possi- ble: Melanchthon Forschungsstelle at Heidelberg University, Department of Theology at Heidelberg University, Darwin Correspondence Project at Cambridge University, Wallace Correspondence Project, and Mark Twain Project. v During my PhD study, I had a wonderful time to work at our Database Research System Group, in particular thanks to a circle of friendly and warm-hearted people. Thank you Dr. Jannik Strötgen for asking me to collaborate with him on Heideltime and helping me publish my first paper in this group. Thank you Dr. Christian Sengstock for helping me solve a lot of questions related to data mining and social network analysis. Thank you Thomas Bögel, you are a really smart person and an ideal office mate. Thank you Andreas, for your valuable encouragement and thoughtful discussions from time to time. Many thanks to HiWis such as Leonard Henger and Niels Bernlöhr, who helped me a lot with software installation and server configuration. Thank you, Lutz Büch, you are a really humorous and optimistic person, who always brings happiness to everyone. Thank you Kai Chen, for imparting you excellent programming skills and experiences to me. My thanks also go to Dr. Tran Van Canh, Dr. Van Quoc Anh, Dr. Ayser Armiti, Dr. Hamed Abdelhaq, Dr. Florian Flatow, Julian Zell, Ruobing Shen, Xiaoyu Chen, Zhen Dong, Diego Costa, Catherine Proux-Wieland, and everyone else in our lunch group for your friendship and support over the years. I am very grateful to Mrs. Natalia Ulrich, Mrs. Dorothea Heukäufer, and Mrs. Anke Sopka, for assisting me in many different ways and handling the paperwork sincerely. I would additionally like to express my gratitude to Elke Pürzer, Johannes Huber, Anna and Rudi for giving me working opportunity and teaching me a lot in other areas besides research. I wish furthermore to thank Jonathan Griffiths as freelance English-language proofreader for having corrected and vastly improved both the language and the style of this dissertation. He was also especially patient and prompt