A General Architecture to Enhance Wiki Systems with Natural Language Processing Techniques
Total Page:16
File Type:pdf, Size:1020Kb
A GENERAL ARCHITECTURE TO ENHANCE WIKI SYSTEMS WITH NATURAL LANGUAGE PROCESSING TECHNIQUES BAHAR SATELI A THESIS IN THE DEPARTMENT OF COMPUTER SCIENCE AND SOFTWARE ENGINEERING PRESENTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE IN SOFTWARE ENGINEERING CONCORDIA UNIVERSITY MONTREAL´ ,QUEBEC´ ,CANADA APRIL 2012 c BAHAR SATELI, 2012 CONCORDIA UNIVERSITY School of Graduate Studies This is to certify that the thesis prepared By: Bahar Sateli Entitled: A General Architecture to Enhance Wiki Systems with Natural Language Processing Techniques and submitted in partial fulfillment of the requirements for the degree of Master of Applied Science in Software Engineering complies with the regulations of this University and meets the accepted standards with respect to originality and quality. Signed by the final examining commitee: Chair Dr. Volker Haarslev Examiner Dr. Leila Kosseim Examiner Dr. Gregory Butler Supervisor Dr. Rene´ Witte Approved Chair of Department or Graduate Program Director 20 Dr. Robin A. L. Drew, Dean Faculty of Engineering and Computer Science Abstract A General Architecture to Enhance Wiki Systems with Natural Language Processing Techniques Bahar Sateli Wikis are web-based software applications that allow users to collabora- tively create and edit web page content, through a Web browser using a simplified syntax. The ease-of-use and “open” philosophy of wikis has brought them to the attention of organizations and online communities, leading to a wide-spread adoption as a simple and “quick” way of collab- orative knowledge management. However, these characteristics of wiki systems can act as a double-edged sword: When wiki content is not prop- erly structured, it can turn into a “tangle of links”, making navigation, organization and content retrieval difficult for their end-users. Since wiki content is mostly written in unstructured natural language, we believe that existing state-of-the-art techniques from the Natural Lan- guage Processing (NLP) and Semantic Computing domains can help miti- gating these common problems when using wikis and improve their users’ experience by introducing new features. The challenge, however, is to find a solution for integrating novel semantic analysis algorithms into the multi- tude of existing wiki systems, without the need for modifying their engines. In this research work, we present a general architecture that allows wiki systems to benefit from NLP services made available through the Semantic Assistants framework – a service-oriented architecture for brokering NLP pipelines as web services. Our main contributions in this thesis include an analysis of wiki engines, the development of collaboration patterns be- tween wikis and NLP, and the design of a cohesive integration architecture. As a concrete application, we deployed our integration to MediaWiki – the powerful wiki engine behind Wikipedia – to prove its practicability. Fi- nally, we evaluate the usability and efficiency of our integration through a iii number of user studies we performed in real-world projects from various domains, including cultural heritage data management, software require- ments engineering, and biomedical literature curation. iv Acknowledgments As Isaac Newton once said, “If I have seen further, it is by standing on the shoulders of giants.” My utmost gratitude goes to my thesis supervisor, Dr. Rene´ Witte for his invaluable guidance, endless patience and under- standing throughout the course of this thesis. His positive outlook, care- ful editing and confidence in my research inspired me to do the best of my ability. I would like to express my appreciation to Dr. Marie-Jean Meurs for offering me her kind support and generous contributions to the evaluation of this thesis. Thank you for always being there for me. I am indebted to you for your kind heart and sisterly love. I would also like to thank my fellow lab mates at the Semantic Software Lab and the scientists of Concordia Centre for Structural and Functional Genomics for their careful feedback on my research work. I thank Caitlin Murphy and Elian Angius for their time and hard work, as well as tens of Software Engineering students who participated in the evaluation of this thesis. Last but not least, I cordially thank my parents and my brother for their everlasting love and encouragements. Words fall short to express how grateful I am to feel the warmth of their presence and support at every step of my life. It is to whom this thesis is dedicated. v Table of Contents List of Figures x List of Tables xiii List of Acronyms xiv 1 Introduction 1 1.1 Motivation ............................... 1 1.2 Approach ............................... 3 1.3 Outline ................................ 6 2 Background 7 2.1 Wiki Systems ............................. 7 2.1.1 System Specifications .................... 7 2.1.2 Markup Languages ..................... 9 2.2 Wikis in Practice ........................... 11 2.2.1 Wikipedia, An Encyclopedia Wiki ............. 11 2.2.2 Personal Information and Knowledge Management . 12 2.2.3 Software Requirements Engineering ............ 12 2.2.4 Enterprise Wikis ....................... 13 2.2.5 Cultural Heritage Data Management . ........ 14 2.3 Common Problems in Wikis .................... 15 2.4 Suggested NLP Solutions ...................... 17 2.5 The Semantic Assistants Architecture .............. 19 2.5.1 System Overview ....................... 19 2.5.2 System Workflow ....................... 21 vi 2.5.3 Service Results ........................ 22 2.6 Resum´ e´ ................................ 23 3 Related Work 24 3.1 NLP-Enhanced Wikis ........................ 24 3.1.1 Wikulu, An Intelligent User Interface for Wikis ..... 24 3.2 Semantic Wikis ............................ 27 3.2.1 Semantic MediaWiki, A Semantic Extension to Medi- aWiki.............................. 27 3.2.2 IkeWiki, A Semantic Wiki for Collaborative Knowledge Management ......................... 30 3.2.3 SweetWiki, A Semantic Web Enabled Technology Wiki . 32 3.2.4 AceWiki, A Natural and Expressive Semantic Wiki . 34 3.3 Discussion .............................. 36 3.4 Resum´ e´ ................................ 37 4 Requirements Analysis 38 4.1 End-User Requirements ....................... 39 4.2 Wiki Developer Requirements ................... 41 4.3 System Requirements ........................ 42 4.4 Discussion .............................. 44 4.5 Resum´ e´ ................................ 46 5 System Design 47 5.1 Design Alternatives ......................... 47 5.1.1 A Browser Plug-in ...................... 48 5.1.2 A Wiki Plug-in ........................ 49 5.1.3 A Semantic Assistants Wiki Component . ........ 51 5.1.4 A Proxy Server Component ................. 52 5.1.5 Summary ........................... 53 5.2 The Analysis Workflow ....................... 54 5.2.1 User Interaction ....................... 55 5.2.2 Service Invocation ...................... 61 5.2.3 Wiki Communication .................... 66 5.3 Transformation of Results ..................... 73 vii 5.3.1 Semantic Metadata Representation ............ 74 5.4 Wiki Independency .......................... 75 5.4.1 Module-based Architecture ................. 75 5.4.2 Semantics-based Architecture ............... 75 5.5 Wiki Ontology ............................. 76 5.6 Developed Solution ......................... 79 5.7 Resum´ e´ ................................ 81 6 Implementation and Application 83 6.1 System Overview ........................... 83 6.2 The Semantic Assistants Servlet .................. 86 6.2.1 The User Interface Module ................. 88 6.2.2 The Wiki Helper Module ................... 91 6.2.3 The Semantic Assistants Broker .............. 92 6.2.4 The Wiki Ontology Repository ............... 93 6.3 The Semantic Assistants Wiki Plug-in .............. 94 6.4 Storing and Presenting Service Results .............. 96 6.4.1 Templating Mechanism ................... 97 6.4.2 Storing the Markup .....................100 6.5 Service Execution Flow .......................102 6.6 Resum´ e´ ................................104 7 Evaluation 106 7.1 Methodology .............................106 7.2 Wiki-based Cultural Heritage Data Management . .......108 7.2.1 Evaluation Scenario .....................109 7.2.2 Results .............................110 7.3 Wiki-based Collaborative Software Requirements Engineering 111 7.3.1 Evaluation Scenario .....................113 7.3.2 Results .............................114 7.4 Wiki-based Biomedical Literature Curation ...........118 7.4.1 Evaluation Scenario .....................119 7.4.2 Results .............................122 7.5 Resum´ e´ ................................125 viii 8 Conclusions and Future Work 126 8.1 Summary ...............................126 8.2 Suggestions for Future Work ....................129 8.3 Conclusion ..............................130 Bibliography 131 A Wiki Upper Ontology Description 139 B MediaWiki Ontology Description 142 C ReqWiki Questionnaire 148 D Questionnaire Responses 158 D.1 Graduate Students Responses ...................158 D.2 Undergraduate Students Responses ...............173 ix List of Figures 1 Wikipedia–afreeencyclopedia wiki ............... 2 2 A subset of MediaWiki text formatting markup . ........ 10 3 A MediaWiki page markup importing the FOAF vocabulary . 10 4 The Semantic Assistants architecture [WG09].......... 20 5 The Semantic Assistants service execution workflow [WG09].21 6 The DTD for Semantic Assistants response messages ..... 23 7 The Wikulu system