Computer Support for the Analysis and Improvement of the Readability of IT-Related Texts
Total Page:16
File Type:pdf, Size:1020Kb
Department of Informatics TECHNISCHE UNIVERSITÄT MÜNCHEN Master’s Thesis in Information Systems Computer Support for the Analysis and Improvement of the Readability of IT-related Texts Matthias Holdorf Department of Informatics TECHNISCHE UNIVERSITÄT MÜNCHEN Master’s Thesis in Information Systems Computer Support for the Analysis and Improvement of the Readability of IT-related Texts Computergestützte Analyse und Verbesserung der Lesbarkeit von IT-bezogenen Texten Author: Matthias Holdorf Supervisor: Prof. Dr. Florian Matthes Advisor: Bernhard Waltl, M.Sc. Submission Date: 15.11.2016 I confirm that this master’s thesis is my own work and I have documented all sources and materials used. Ich versichere, dass ich diese Master’s Thesis selbstständig verfasst und nur die angegebenen Quellen und Hilfsmittel verwendet habe. München, 15.11.2016 Matthias Holdorf Acknowledgments First and foremost, I would like to thank my advisor Bernhard Waltl and my industry advisor Andreas Zitzelsberger for their preeminent support, interest, and time. I feel fortunate to have had the opportunity to learn from both the academic field and the industry. Furthermore, I would like to thank Prof. Dr. Florian Matthes for his time and feedback, and for providing me the opportunity to write this thesis at the Software Engineering for Business Information Systems (SEBIS) chair, which he holds. I also want to thank my conversation partners Tobias Waltl, Mark Becker, and Henning Femmer. I would like to thank the numerous participants of the quantitative survey, and especially my interview partners. During the search for interview partners, we had an astonishing 100% confirmation rate. Even the managing directors took the time to answer our questions. Such support and appreciation for our work felt overwhelming. I am grateful for the possibility to write my thesis at QAware, which provided us with the environment and technical infrastructure to make this thesis a great project on which I will gladly look back in the future. Abstract Context: A major task in information technology (IT) is communication. Difficult-to-read text hinders the communication between stakeholders and can have expensive consequences. Objectives: We aim to design a tool that decreases the amount of time and resources needed to improve the readability of an IT-related text. Method: We transfer the concept of bug pattern in static code analysis to the readability of text as readability anomalies. The term readability anomaly refers to an indicator of difficult-to-read text passages that may negatively affect communication. To identify the business needs of a software company with a staff of 100 employees, we conducted qualitative interviews and a quantitative survey. Furthermore, we reviewed existing approaches and methodologies from the knowledge base. Subsequently, we designed and implemented a readability checker based on the elicited requirements. Results: The results of the interviews confirmed the assumptions of previous work: Difficult-to-read text hinders communication. The anomaly detection yielded an average precision of 69% with high variation. We investigated the relevance of the true-positive findings with a controlled experiment. Our participants considered 64% of the findings as relevant and would incorporate 59% immediately. Moreover, they were not aware of 48% of the findings. During the application of the tool, the practitioners have incorporated 49% of the overall findings. An analysis of our readability checker takes an average of 40 seconds for 10,000 words. Conclusion: Our readability analysis tool (RAT) can uncover many practically relevant anomalies. Although some readability anomalies need to be adjusted or have to be supported by richer linguistic features, the checker provides effective means to improve the readability of IT-related texts. Based on our application in a practical environment, we found the following requirements and prospects for future work: Improvement of the precision and relevance of anomalies, domain-specific anomalies, configurability of anomaly detection, paraphrasing of detected anomalies, performance of an analysis, integration in the workflow of a company, support of various file formats, and the extent of integration in text processing programs. Keywords: Natural Language Processing, Readability Assessment, Style Checker, Readability Checker, UIMA, DKPro Core, Office Open XML Content 1. Introduction ......................................................................................................................................... 1 1.1 Problem Statement ..................................................................................................................... 2 1.2 Research Approach .................................................................................................................... 3 1.2.1 Behavioral Science ................................................................................................................. 3 1.2.2 Design Science ....................................................................................................................... 4 1.2.3 Research Process ................................................................................................................... 4 1.2.4 Summary ................................................................................................................................. 4 1.3 Contributions .............................................................................................................................. 5 1.3.1 Positioning of Research ........................................................................................................ 5 1.3.2 Research Questions ............................................................................................................... 6 1.4 Outline ......................................................................................................................................... 8 2. Knowledge Base .................................................................................................................................. 9 2.1 Terminology ................................................................................................................................ 9 2.2 Taxonomy of Related Work ................................................................................................... 12 2.2.1 Readability Formulas ........................................................................................................... 12 2.2.2 Spell Checker ........................................................................................................................ 13 2.2.3 Grammar Checker ............................................................................................................... 13 2.2.4 Style and Readability Checker ............................................................................................ 13 2.2.5 Controlled Language Checker ........................................................................................... 14 2.2.6 Text Simplification .............................................................................................................. 15 2.2.7 Paraphrasing ......................................................................................................................... 16 2.3 Academic Approaches ............................................................................................................. 17 2.3.1 MULTILINT ....................................................................................................................... 17 2.3.2 TextLint................................................................................................................................. 18 2.3.3 Smella .................................................................................................................................... 19 2.3.4 DeLite .................................................................................................................................... 20 2.3.5 Coh-Metrix ........................................................................................................................... 22 2.3.6 EasyEnglish .......................................................................................................................... 23 2.4 Industry Approaches ............................................................................................................... 24 2.4.1 LanguageTool ....................................................................................................................... 24 2.4.2 LinguLab ............................................................................................................................... 25 2.4.3 Grammarly ............................................................................................................................ 26 2.5 Overview of Related Work ..................................................................................................... 27 2.6 Discussion ................................................................................................................................. 28 3. Environment ...................................................................................................................................... 29 3.1 Interview Design ...................................................................................................................... 29 3.2 Interview