Scoring Answers in Stack Overflow by Raul Quintana

Predictive Model: Using Text Mining for Determining Factors Leading to High- Scoring Answers in Stack Overflow by Raul Quintana Selleras B.S. in Information Technology, August 2012, Florida International University B.A. in Religious Studies, December 2012, Florida International University M.S. in Information Systems, May 2015, The University of Texas at Arlington A Praxis submitted to The Faculty of The School of Engineering and Applied Science of The George Washington University in partial fulfillment of the requirements for the degree of Doctor of Engineering January 10, 2020 Praxis directed by Timothy Blackburn Professorial Lecturer of Engineering Management and Systems Engineering Amir Etemadi Associate Professor of Engineering and Applied Science The School of Engineering and Applied Science of The George Washington University certifies that Raul Quintana Selleras has passed the Final Examination for the degree of Doctor of Engineering as of October 15, 2019. This is the final and approved form of the Praxis. Predictive Model: Using Text Mining for Determining Factors Leading to High- Scoring Answers in Stack Overflow Raul Quintana Selleras Praxis Research Committee: Timothy Blackburn, Professorial Lecturer of Engineering Management and Systems Engineering, Praxis Co-Director Amir Etemadi, Associate Professor of Engineering and Applied Science, Praxis Co-Director Ebrahim Malalla, Visiting Associate Professor of Engineering and Applied Science, Committee Member ii © Copyright 2019 by Raul Quintana Selleras All rights reserved iii Dedication The author wishes to dedicate this dissertation to his daughter, Alexia Quintana and to his wife, Kristina Quintana for their unconditional support. Also, the author would like to thank his parents, Raul Quintana Sarduy and Gilda Selleras Rivas, whose encouragement was vital to his educational accomplishments. iv Acknowledgments The author wishes to acknowledge his praxis director, Dr. Timothy Blackburn; his editor, Peter Rosenbaum, along with all faculty and staff from the Doctor of Engineering program and the students from the seventh cohort. The author thanks Andrew Rothman and Lucas Longan for their insightful suggestions. v Abstract of Praxis Predictive Model: Using Text Mining for Determining Factors Leading to High- Scoring Answers in Stack Overflow With the advent of knowledge-based economies, knowledge transfer within online forums has become increasingly important to the work of IT teams. Stack Overflow, for example, is an online community in which computer programmers can interact and consult with one another to achieve information flow efficiencies and bolster their reputations, which are numerical representations of their standings within the platform. The high volume of information available in Stack Overflow in the context of significant variance in members’ expertise and, hence, the quality of their posts hinders knowledge transfer and causes developers to waste valuable time locating good answers. Additionally, invalid answers can introduce security vulnerabilities and/or legal risks. By conducting text analytics and regression, this research presents a predictive model to optimize knowledge transfer among software developers. This model incorporates the identification of factors (e.g., good tagging, answer character count, tag frequency) that reliably lead to high-scoring answers in Stack Overflow. Upon applying natural language processing, the following variables were found to be significant: (a) the number of answers per question, (b) the cumulative tag score, (c) the cumulative comment score, and (d) the bags of words’ frequency. Additional methods were used to identify the factors that contribute to an answer being selected by the user who posted the question, the community at large, or both. vi Predicting what constitutes a good, accurate answer helps not only developers but also Stack Overflow, as the site can redesign its user interface to make better use of its knowledge repository to transfer knowledge more effectively. Likewise, companies who use the platform can decrease the amount of time and resources invested in training, fix software bugs faster, and complete challenging projects in a timely fashion. vii Table of Contents Dedication ......................................................................................................................... iv Acknowledgments ............................................................................................................. v Abstract of Praxis ............................................................................................................ vi List of Figures .................................................................................................................... x List of Tables ................................................................................................................... xii List of Symbols / Nomenclature .................................................................................... xiii Glossary of Terms .......................................................................................................... xiv Chapter 1: Introduction ....................................................................................................... 1 1.1 Background ....................................................................................................... 1 1.2 Research Motivation ......................................................................................... 5 1.3 Problem Statement ............................................................................................ 6 1.4 Thesis Statement ............................................................................................... 8 1.5 Research Objectives ........................................................................................ 10 1.6 Research Questions and Hypotheses .............................................................. 12 1.7 Scope of Research ........................................................................................... 14 1.8 Research Limitations ...................................................................................... 14 1.9 Organization of Praxis .................................................................................... 15 Chapter 2: Literature Review ............................................................................................ 17 2.1 Introduction ..................................................................................................... 17 2.2 Information, Knowledge, and Related Concepts ............................................ 18 2.3 Digging into Stack Overflow .......................................................................... 21 2.4 Knowledge Transfer ........................................................................................ 24 viii 2.5 Online Forums ................................................................................................ 31 2.6 Summary and Conclusions ............................................................................. 33 Chapter 3: Methodology ................................................................................................... 39 3.1 Introduction ..................................................................................................... 39 3.2 Data Collection and Analysis .......................................................................... 43 3.3 Research Methods ........................................................................................... 47 Chapter 4: Results ............................................................................................................. 57 4.1 Introduction ..................................................................................................... 57 4.2 Data Collection and Preprocessing ................................................................. 59 4.3 Predictive Models ........................................................................................... 67 4.4 Case Studies .................................................................................................... 80 Chapter 5: Discussion and Conclusions ............................................................................ 86 5.1 Discussion ....................................................................................................... 86 5.2 Conclusions ..................................................................................................... 86 5.3 Contributions to Body of Knowledge ............................................................. 88 5.4 Recommendations for Future Research .......................................................... 88 References ......................................................................................................................... 92 Appendix A ..................................................................................................................... 102 Appendix B ..................................................................................................................... 134 ix List of Figures Figure 1-1. Stack Overflow question. ................................................................................. 4 Figure 1-2. Stack Overflow answer. ................................................................................... 4 Figure 1-3. Stack Overflow and optimal answer region. .................................................. 11 Figure 2-1. Interest graph.

Scoring Answers in Stack Overflow by Raul Quintana

Uila Supported Apps

Corrective Or Critical? Commenting on Bad Questions in Q&A

Mathematics 2020

An Analysis of Reputation: Currency of the Stack Exchange Network

A Study of Question Effectiveness Using Reddit ``Ask Me Anything

Finding a Growth Business Model at Stackoverflow-Final2

“What Makes a Software Developer a Great Business Developer?” // F E a T U R E S // Oct-Nov ‘16

16 Best Web Development & Web Design Blogs

Data Structures

Notices of the American Mathematical Society ABCD Springer.Com

Quora 1 Quora

Quantifying Voter Biases in Online Platforms: an Instrumental Variable Approach