The Curious Case of Posts on Stack Overflow
Total Page:16
File Type:pdf, Size:1020Kb
The curious case of posts on Stack Overflow Shailja Shukla Subject: (Information Systems) Corresponds to: (30 hp) Presented: (VT 2020) Supervisor: Mudassir Imran Mustafa Department of Informatics and Media 1 Contents Abstract ...................................................................................................................................... 6 Acknowledgements .................................................................................................................... 7 Chapter 1 .................................................................................................................................... 8 1. Introduction ........................................................................................................................ 8 1.1. Background ................................................................................................................ 8 1.2. Motivation ................................................................................................................ 10 1.2 Research Questions .................................................................................................. 11 1.3 Delimitation: ............................................................................................................ 12 1.4 Limitation:................................................................................................................ 12 Chapter 2 .................................................................................................................................. 13 2. Theory ............................................................................................................................... 13 2.1 Topic Modelling: ..................................................................................................... 13 2.2 Latent Dirichlet Allocation (LDA): ......................................................................... 14 2.3 Related Work ........................................................................................................... 15 Chapter 3 .................................................................................................................................. 17 3. Methodology:.................................................................................................................... 17 3.1 Data Collection: ....................................................................................................... 18 3.2 Data Extraction: ....................................................................................................... 18 3.2.1 Schema: ................................................................................................................. 19 3.3 Data Pre-processing: ................................................................................................ 20 3.1.1 Subset corpus data: .............................................................................................. 20 3.1.2 Remove code snippets: ........................................................................................ 21 3.3.3 Combine related documents to form a single corpus: .......................................... 22 3.3.4 Tokenization: ....................................................................................................... 22 3.3.5 Lowercasing: ........................................................................................................ 23 3.3.6 Remove punctuations: .......................................................................................... 23 3.3.7 Text Standardization/Replace Contractions:........................................................ 23 3.3.8 Remove stop words: ............................................................................................. 24 3.3.9 Remove URLs:..................................................................................................... 24 3.3.10 Minimum size words: ...................................................................................... 24 3.3.11 Remove multiple whitespaces: ........................................................................ 25 3.3.12 Generate N-Grams: .......................................................................................... 25 3.3.13 Stemming: ........................................................................................................ 25 3.3.14 Lemmatisation: ................................................................................................ 26 2 3.4 Create Dictionary and Term Document Frequency: ................................................ 26 3.5 Run the LDA model: ................................................................................................ 28 Chapter 4 .................................................................................................................................. 29 4 Analysis: ........................................................................................................................... 29 Chapter 5 .................................................................................................................................. 34 5 Result ................................................................................................................................ 34 5.1 RQ1- What are the popular discussion topics in Stack Overflow? .......................... 34 5.1.1 Web as a recurring discussion topic: ................................................................... 36 5.1.2 UI Development as a recurring discussion topic: ................................................ 37 5.1.3 Data management as a recurring discussion topic: .............................................. 37 5.2 RQ2- How does the developer's interest change over time? ................................... 38 5.3 RQ3- How do the interests in specific technologies change over time?.................. 39 5.3.1 React vs Angular .................................................................................................. 39 5.3.2 Python vs JavaScript ............................................................................................ 40 5.3.3. Popular discussion topics related to Web technologies ................................... 40 5.3.4 Relational Databases (RDBMS) .......................................................................... 41 5.3.5 Android vs iOS .................................................................................................... 42 5.3.6 Object-Oriented Programming............................................................................. 43 5.3.7 Machine Learning ................................................................................................ 44 Chapter 6 .................................................................................................................................. 45 6 Validity of research and experiences: ............................................................................... 45 Chapter 7 .................................................................................................................................. 46 7 Conclusion: ....................................................................................................................... 46 Chapter 8 .................................................................................................................................. 47 8 Discussion & Future Work: .............................................................................................. 47 Appendix 1: Tools and technology .......................................................................................... 48 Appendix 2: Popular discussion topics lists among developers: ............................................. 49 Appendix 3: Acronym / Abbreviation Table ........................................................................... 54 References: ............................................................................................................................... 56 3 Table of Figures: Figure 1: Venn Diagram of the intersection of the Text Mining and six related fields (Miner et al., 2012) ................................................................................................................................ 9 Figure 2: Schematic Overview of LDA (Debortoli et al., 2016). ............................................ 14 Figure 3: Methodology Model ................................................................................................. 17 Figure 4: Sample user post before cleaning of code snippet from the text content. ................ 21 Figure 5: Sample user post after cleaning of code snippet from the text content. ................... 21 Figure 6: Title of sample user post .......................................................................................... 22 Figure 7: Body of sample user post ......................................................................................... 22 Figure 8: Combined title and body of sample user post text ................................................... 22 Figure 9: Sample text before pre-processing .......................................................................... 25 Figure 10: Sample text after partial pre-processing ................................................................. 25 Figure 11: Sample text before stemming and lemmatisation................................................... 26 Figure 12: Sample text after stemming and lemmatisation ....................................................