Automated Development of Semantic Data Models Using Scientific Publications Martha O
Total Page:16
File Type:pdf, Size:1020Kb
University of New Mexico UNM Digital Repository Computer Science ETDs Engineering ETDs Spring 5-12-2018 Automated Development of Semantic Data Models Using Scientific Publications Martha O. Perez-Arriaga University of New Mexico Follow this and additional works at: https://digitalrepository.unm.edu/cs_etds Part of the Computer Engineering Commons Recommended Citation Perez-Arriaga, Martha O.. "Automated Development of Semantic Data Models Using Scientific ubP lications." (2018). https://digitalrepository.unm.edu/cs_etds/89 This Dissertation is brought to you for free and open access by the Engineering ETDs at UNM Digital Repository. It has been accepted for inclusion in Computer Science ETDs by an authorized administrator of UNM Digital Repository. For more information, please contact [email protected]. Martha Ofelia Perez Arriaga Candidate Computer Science Department This dissertation is approved, and it is acceptable in quality and form for publication: Approved by the Dissertation Committee: Dr. Trilce Estrada-Piedra, Chairperson Dr. Soraya Abad-Mota, Co-chairperson Dr. Abdullah Mueen Dr. Sarah Stith i AUTOMATED DEVELOPMENT OF SEMANTIC DATA MODELS USING SCIENTIFIC PUBLICATIONS by MARTHA O. PEREZ-ARRIAGA M.S., Computer Science, University of New Mexico, 2008 DISSERTATION Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Computer Science The University of New Mexico Albuquerque, New Mexico May, 2018 ii Dedication “The highest education is that which does not merely give us information but makes our life in harmony with all existence" Rabindranath Tagore I dedicate this work to the memory of my primary role models: my mother and grandmother, who always gave me a caring environment and stimulated my curiosity. All through my childhood, I spent most of the time with my grandmother: Adela, who with proverbs and a kind smile revealed her honest, empathic, trustworthy, and wise persona to me. My mother, Clara, was always there for any people in need, and her personality full of cheer and encourage was contagious. She loved and enjoyed life deeply, while being resourceful, ethical, tolerant, and responsible under any circumstances. iii Acknowledgments First and foremost, I am profoundly thankful to my advisor and co-advisor Dr. Trilce Estrada and Dr. Soraya Abad-Mota, who appeared in my life when I truly needed academic guidance. Their ideas and human values inspire me to keep working in science. In addition, I sincerely thank the other two professors in my dissertation committee, Dr. Abdullah Mueen and Dr. Sarah Stith. In particular, the chair of this dissertation committee, Dr. Trilce Estrada, supported me with her valuable expertise and constant recommendations at each stage of the development of this investigation, from the search of this topic, to implementation, alignment of ideas with writing, and completion. The co-chair of this dissertation committee, Dr. Soraya Abad-Mota, always presented me relevant recommendations, positive encouragement, and constructive comments of every aspect of this work, from inception to conclusion. Dr. Abdullah Mueen provided me useful feedback and suggestions to improve my presentations and dissertation work, and Dr. Sarah Stith contributed with fruitful conversations about my dissertation research, writing, and job search. I give special thanks to Dr. Thomas Preston Caudell for teaching me the fundamentals on research work and the wonderful world of Neural Networks, as well as for being an exceptional advisor during my master’s degree. I express thanks to Dr. Dorian Arnold for his support and mentorship while searching for an advisor to complete my degree. I acknowledge the Visualization and Data Science Laboratories, whose members gave me friendship and useful feedback for research and presentations: Shan, Takeshi, Victor, Cameron, Jacob, Jeremy, Matt, Xin Yu, and Rudy. I thank Haleh, who collected the last dataset used in this work. I am grateful to my study partners for the comprehensive examination, particularly, Josh and Joe. I thank all my professors and classmates, especially those who helped me to learn either about Computer Science or about life. iv I recognize the National Council of Science and Technology (Consejo Nacional de Ciencia y Tecnología) for funding resources at the beginning of my studies and for strengthening the scientific investigation in Mexico. I am appreciative of the collaboration with all the people I had the opportunity to work at the Sandia National Laboratories, where I developed more problem solving and programming skills for issues in the field of Bioinformatics and Data Management. I appreciate two extraordinary women who appeared along my path during a search for tools to improve myself, Dr. Susan Cooper for beneficial observations on writing and presenting technical information; and Mable Orndorff-Plunkett for being a supportive mentor, helping with the search of an editor, and giving me excellent pointers for presentations as well. I am deeply thankful to my family and friends, especially to my daughter and husband for their patience and understanding on my continuous search of knowledge, equality, and justice. Tania Saraí was the major motivation to start this particular journey and shared her time with my studies. Aaron Andrew gave me the encouragement, strength, and emotional support to complete this important chapter in our pilgrimage. I express gratitude to Edmundo because he always encouraged me to pursue my passion for science and challenged me to search for the truth of life. I am grateful to the rest of my family that even in the distance supported me unconditionally: Arturo, Ernesto, Guadalupe, Joaquín, Meg, Nicholas, Norma, Patricia, Robert, Sayward, Sergio, Trinidad, and Verónica. Finally, I have gratitude to my dear friends Anne, Claudia, Elisa, Mariela, Mercedes’ daughter, and Mercedes’ mom, who gave me cheerful moments and brightened my path with their presence. v AUTOMATED DEVELOPMENT OF SEMANTIC DATA MODELS USING SCIENTIFIC PUBLICATIONS by MARTHA O. PEREZ-ARRIAGA M.S., Computer Science, University of New Mexico, 2008 Ph.D., Computer Science, University of New Mexico, 2018 Abstract The traditional methods for analyzing information in digital documents have evolved with the ever-increasing volume of data. Some challenges in analyzing scientific publications include the lack of a unified vocabulary and a defined context, different standards and formats in presenting information, various types of data, and diverse areas of knowledge. These challenges hinder detecting, understanding, comparing, sharing, and querying information rapidly. I design a dynamic conceptual data model with common elements in publications from any domain, such as context, metadata, and tables. To enhance the models, I use related definitions contained in ontologies and the Internet. Therefore, this dissertation generates semantically-enriched data models from digital publications based on the Semantic Web principles, which allow people and computers to work cooperatively. Finally, this work uses a vocabulary and ontologies to generate a structured characterization and organize the data models. This organization allows integration, sharing, management, and comparing and contrasting information from publications. vi Specifically, this work develops automated methods for Information Extraction, Semantic Analytics, Information Modeling, and Information Retrieval. I research how to deal with the diversity of tabular and document layouts, and detect, ex- tract, and organize information from tables embedded in digital publications using an adaptive method. To understand relevant information in documents, I interpret and enrich tables, disambiguating concepts and finding unique definitions from a general ontology and the Internet with an unsupervised learning method, while keeping the provenance information for each publication. Furthermore, I apply Natural Language Processing methods to discover and extract semantic relationships from text with non-standard vocabulary and various writing styles. Applying these methods to a variety of digital publications, I formally characterize semantic data models in a machine-readable format. These models allow us to create a network of publications to integrate, manage, and analyze information not only in a specific domain, but also among disciplines. vii Contents List of Figures xii List of Tables xiv Glossary xv 1 Introduction 1 1.1 Overview . 2 1.2 Thesis and Summary of Contributions . 7 1.2.1 Contributions . 8 1.2.2 Development of Methods for Semantic Data Models . 9 2 Literature Review 12 2.1 Table Analysis . 12 2.2 Semantic Relationships . 18 2.3 Semantic Data Integration . 20 viii Contents 3 Table Detection and Information Extraction 25 3.1 Discovery of Table Information . 26 3.2 Table Identification . 27 3.2.1 Document Conversion . 29 3.2.2 Table Detection . 32 3.2.3 Table Extraction . 34 3.3 Evaluation of Table Identification . 41 3.3.1 Baseline and Datasets of Table Identification . 41 3.3.2 Results and Discussion of Table Identification . 42 4 Table Interpretation to Synthesize Documents 46 4.1 Understanding Tables and Documents . 46 4.2 The Semantic Web and Sources of Knowledge . 48 4.3 Table Interpretation . 51 4.3.1 Discovery of Metadata and Context . 53 4.3.2 Entity Recognition, Annotation, and Disambiguation . 56 4.3.3 Discovery of Relationships . 60 4.3.4 Working Example to Synthesize a Publication . 68 4.4 Evaluation of Table Interpretation . 73 4.4.1 Dataset and Experiments of Table Interpretation