Predictive Model: Using Text Mining for Determining Factors Leading to High- Scoring Answers in Stack Overflow
by Raul Quintana Selleras
B.S. in Information Technology, August 2012, Florida International University B.A. in Religious Studies, December 2012, Florida International University M.S. in Information Systems, May 2015, The University of Texas at Arlington
A Praxis submitted to
The Faculty of The School of Engineering and Applied Science of The George Washington University in partial fulfillment of the requirements for the degree of Doctor of Engineering
January 10, 2020
Praxis directed by
Timothy Blackburn Professorial Lecturer of Engineering Management and Systems Engineering
Amir Etemadi Associate Professor of Engineering and Applied Science The School of Engineering and Applied Science of The George Washington University
certifies that Raul Quintana Selleras has passed the Final Examination for the degree of
Doctor of Engineering as of October 15, 2019. This is the final and approved form of the
Praxis.
Predictive Model: Using Text Mining for Determining Factors Leading to High- Scoring Answers in Stack Overflow
Raul Quintana Selleras
Praxis Research Committee:
Timothy Blackburn, Professorial Lecturer of Engineering Management and Systems Engineering, Praxis Co-Director
Amir Etemadi, Associate Professor of Engineering and Applied Science, Praxis Co-Director
Ebrahim Malalla, Visiting Associate Professor of Engineering and Applied Science, Committee Member
ii
© Copyright 2019 by Raul Quintana Selleras All rights reserved
iii
Dedication
The author wishes to dedicate this dissertation to his daughter, Alexia Quintana and to his wife, Kristina Quintana for their unconditional support. Also, the author would like to thank his parents, Raul Quintana Sarduy and Gilda Selleras Rivas, whose encouragement was vital to his educational accomplishments.
iv
Acknowledgments
The author wishes to acknowledge his praxis director, Dr. Timothy Blackburn; his editor, Peter Rosenbaum, along with all faculty and staff from the Doctor of Engineering program and the students from the seventh cohort.
The author thanks Andrew Rothman and Lucas Longan for their insightful suggestions.
v
Abstract of Praxis
Predictive Model: Using Text Mining for Determining Factors Leading to High- Scoring Answers in Stack Overflow
With the advent of knowledge-based economies, knowledge transfer within online
forums has become increasingly important to the work of IT teams. Stack Overflow, for example, is an online community in which computer programmers can interact and consult with one another to achieve information flow efficiencies and bolster their
reputations, which are numerical representations of their standings within the platform.
The high volume of information available in Stack Overflow in the context of significant
variance in members’ expertise and, hence, the quality of their posts hinders knowledge
transfer and causes developers to waste valuable time locating good answers.
Additionally, invalid answers can introduce security vulnerabilities and/or legal risks.
By conducting text analytics and regression, this research presents a predictive model to optimize knowledge transfer among software developers. This model incorporates the identification of factors (e.g., good tagging, answer character count, tag frequency) that reliably lead to high-scoring answers in Stack Overflow. Upon applying natural language processing, the following variables were found to be significant: (a) the number of answers per question, (b) the cumulative tag score, (c) the cumulative
comment score, and (d) the bags of words’ frequency. Additional methods were used to
identify the factors that contribute to an answer being selected by the user who posted the
question, the community at large, or both.
vi
Predicting what constitutes a good, accurate answer helps not only developers but also Stack Overflow, as the site can redesign its user interface to make better use of its knowledge repository to transfer knowledge more effectively. Likewise, companies who use the platform can decrease the amount of time and resources invested in training, fix software bugs faster, and complete challenging projects in a timely fashion.
vii
Table of Contents
Dedication ...... iv
Acknowledgments ...... v
Abstract of Praxis ...... vi
List of Figures ...... x
List of Tables ...... xii
List of Symbols / Nomenclature ...... xiii
Glossary of Terms ...... xiv
Chapter 1: Introduction ...... 1
1.1 Background ...... 1
1.2 Research Motivation ...... 5
1.3 Problem Statement ...... 6
1.4 Thesis Statement ...... 8
1.5 Research Objectives ...... 10
1.6 Research Questions and Hypotheses ...... 12
1.7 Scope of Research ...... 14
1.8 Research Limitations ...... 14
1.9 Organization of Praxis ...... 15
Chapter 2: Literature Review ...... 17
2.1 Introduction ...... 17
2.2 Information, Knowledge, and Related Concepts ...... 18
2.3 Digging into Stack Overflow ...... 21
2.4 Knowledge Transfer ...... 24
viii
2.5 Online Forums ...... 31
2.6 Summary and Conclusions ...... 33
Chapter 3: Methodology ...... 39
3.1 Introduction ...... 39
3.2 Data Collection and Analysis ...... 43
3.3 Research Methods ...... 47
Chapter 4: Results ...... 57
4.1 Introduction ...... 57
4.2 Data Collection and Preprocessing ...... 59
4.3 Predictive Models ...... 67
4.4 Case Studies ...... 80
Chapter 5: Discussion and Conclusions ...... 86
5.1 Discussion ...... 86
5.2 Conclusions ...... 86
5.3 Contributions to Body of Knowledge ...... 88
5.4 Recommendations for Future Research ...... 88
References ...... 92
Appendix A ...... 102
Appendix B ...... 134
ix
List of Figures
Figure 1-1. Stack Overflow question...... 4
Figure 1-2. Stack Overflow answer...... 4
Figure 1-3. Stack Overflow and optimal answer region...... 11
Figure 2-1. Interest graph...... 20
Figure 3-1. Delen’s holistic framework for knowledge management discovery...... 40
Figure 3-2. Supportability analysis...... 43
Figure 3-3. Relationship between critical data fields collected...... 45
Figure 3-4. Research product...... 49
Figure 3-5. Data preprocessing...... 51
Figure 3-6. Text analytics...... 52
Figure 3-7. ProWritingAid toolbar in Microsoft Office 365 ProPlus...... 53
Figure 3-8. ProWritingAid summary...... 53
Figure 3-9. Post metadata...... 54
Figure 3-10. Author metadata...... 55
Figure 4-1. Stack Overflow users’ age, educational level, and wake-up time...... 58
Figure 4-2. Most popular technologies in Stack Overflow...... 59
Figure 4-3. Query used for data collection in SEDE...... 60
Figure 4-4. SEDE’s partial results...... 61
Figure 4-5. Query’s partial execution plan...... 61
Figure 4-6. C-sharp functionality for calculating BOW scores...... 66
Figure 4-7. Linking data source and quantitative methods...... 67
Figure 4-8. Post frequency distribution per week...... 69
x
Figure 4-9. PLS standard coefficient plot for answer score...... 73
Figure 4-10. PLS loading plot for answer score...... 73
Figure 4-11. PLS response plot for answer score...... 74
Figure 4-12. Relationship model between research hypotheses and case studies...... 80
xi
List of Tables
Table 3-1. Mapping Delen’s Holistic Framework into Stack Overflow ...... 41
Table 4-1. Sample data for a single post ...... 62
Table 4-2. Analysis of Variance ...... 74
Table 4-3. Model Selection and Validation ...... 75
Table 4-4. Coefficients ...... 75
Table 4-5. Deviance table ...... 78
Table 4-6. Model Summary ...... 78
Table 4-7. Coefficients ...... 78
Table 4-8. Goodness-of-fit Tests ...... 79
Table 4-9. Case studies in API and JavaScript ...... 84
Table A-1. SEDE’s sample data for the First Hypothesis ...... 102
Table A-2. SEDE’s sample data for the Second Hypothesis ...... 103
Table A-3. List of SEDE terms ...... 104
Table B-1. Query list ...... 134
xii
List of Symbols / Nomenclature
PK Primary Key
FK Foreign Key
xiii
Glossary of Terms
ANOVA Analysis of Variance
API Application Program Interface
BLR Binary Logistic Regression
BME Belief Measure of Expertise
BOW Bags of Words
BPSO Binary Particle Swarm Optimization
CICO Chief Intellectual Capital Officer
CINO Chief Innovation Officer
CKO Chief Knowledge Officer
CLO Chief Learning Officer
CPO Chief People Officer
CQA Community Question Answering
CSCW Computer-Supported Cooperative Work
CST Central Standard Time
CSV Comma-Separated-Values
DoS Denial of Service
EST Eastern Standard Time
xiv
FST Fixation Index
GWASs Genome-Wide Association Studies
HR Human Resources
ICTs Information and Communication Technologies
IS Information Systems
IT Information Technology
KB Knowledge Base
KM, VP of Vice President of Knowledge Management
LDA Latent Dirichlet Allocation
M&As Mergers and Acquisitions
MIMIC Multiple Indicators and Multiple Causes
NER Named Entity Recognition
NIR Near-Infrared Ray
NLP Natural Language Processing
NLTK Natural Language Toolkit
PLS Partial Least Squares
SEDE Stack Exchange Data Explorer
SEO Search Engine Optimization
xv
SO Stack Overflow
SOA Service-Oriented Architecture
SPI Software Process Improvement
SQL Structured Query Language
TSC Transfer Spectral Clustering
ULR Uniform Resource Locator
UTF Unicode Transformation Format
UX User Experience
VSD Value-Sensitive-Design
xvi
Chapter 1: Introduction
1.1 Background
Over the past two decades, economies have shifted from being production-based to being knowledge-based, and knowledge management software and services have become multi-billion-dollar industries (Kim, Song and Jones 2011; Roberts 2000; Sarker et al. 2005; Schaufeld 2015). A report from the National Science Board highlights the importance of knowledge transfer in the creation of successful patents and active licensing deals and the major influence knowledge-based transfer has had on invention and innovation (National Science Foundation 2017). Conversely, as Babcock (2004) observed, failed knowledge transfer1 can translate into loss of property and life.
A professional knowledge worker is an individual who researches, produces, analyzes, and interprets information and ideas. Knowledge transfer is the dynamic, unnatural2 practice of transferring knowledge between people and departments, usually trying to organize, create, capture, and distribute intellectual content within an organization (National Science Foundation 2017). Knowledge transfer differs from traditional learning or education in that it is not usually conducted in a hierarchical fashion.
Knowledge transfer is particularly inefficient for software development teams
(Raytheon Professional Services LLC 2012). Many developers rely on Internet resources
1 Both in terms of absorption and transmission.
2 Knowledge acquisition and sharing is time-consuming and usually discouraged by management. Besides, knowledge hoarding is correlated with job security (Davenport 1997).
1
and online forums rather than on other members of their teams to obtain answers to their
technical questions or concerns. Stack Overflow attempts to streamline knowledge
transfer between technical users by opening a collaboration platform to address programming problems.
Stack Overflow is the world’s largest and most popular Q&A community (online forum) for technical users. It services over 50 million computer programmers and software developers worldwide, 60% of whom work on back-end projects (Zhang 2018).
On average, each developer accesses the site a total of six times a month, spending at least fifteen seconds per visit. Additionally, Stack Overflow is the largest and most popular community within Stack Exchange, having the highest number of participants
and posts. As of June 2019, Stack Overflow had 11 million users and stored 18 million
questions and 27 million answers. Questions currently range over a wide variety of technical topics, such as programming languages, algorithms, APIs, disruptive
technologies, troubleshooting, configuration, and technical definitions. The success of the
Stack Overflow platform depends on users’ willingness to collaborate by asking and
answering each other’s questions (Calefato 2018). In terms of user participation, Stack
Overflow is an online knowledge production site controlled by small groups of
contributors3 (Matei 2018). Oliveira (2018) classifies users as experts4 or activists.5 Over
3 Also known as elite stickiness.
4 Having low participation with high-quality posts.
5 Having high participation with low-quality posts.
2
94% of participants infrequently contribute, and highly engaged contributors are extremely uncommon (Oliveira 2018).
Stack Overflow is an open community in which users have a say in how the site behaves, have access to the data from Stack Exchange, and experience a sense of ownership over the platform. This accomplishes both network and feedback effects across the entire community. Chua (2015) defines Stack Overflow as a knowledge- sharing, community-owned, online platform that fits a Community Question Answering
(CQA) site harnessing collective wisdom. Additionally, Stack Overflow fits under the
definition of Computer-Supported Cooperative Work (CSCW).
Stack Overflow adheres to a sequential process. First, registered users post a
question (see Figure 1-1). At this juncture, questioners can enter a title, a description, tags,6 and code snippets regarding the problem they are trying to solve. Other registered
users start offering solutions to the question. The community can upvote/praise,
downvote/criticize, or comment on each answer, deciding which answer gets the highest
score. Downvotes and upvotes directly establish the reputation of a given user; i.e., the
more upvotes a user gets, the better will be the user’s reputation within the community.
Furthermore, only the questioner can choose a given answer as the preferred one; the answer then becomes accepted and thereby resolved (see Figure 1-2).
6 Buzzwords identifying the subject matter of the inquiry.
3
Figure 1-1. Stack Overflow question.
Figure 1-2. Stack Overflow answer.
4
Successful questions (i.e., those that are answered) have certain attributes: They
are posed by reputable users, are short and clear, contain code snippets, do not abuse
uppercase characters, and adopt a neutral7 emotional style (Calefato 2018). To determine
the answerability of a question, researchers use a framework with the following features:
affect,8 presentation quality,9 post time, and reputation (Calefato 2018). Approximately
29% of questions on Stack Overflow do not receive answers that are accepted10 even
though many of them have one or more proposed solutions.
1.2 Research Motivation
Stack Overflow is one of the most popular websites among software developers
and programming enthusiasts. That fact, along with the naturally complex nature of
computer systems and the pervasiveness of programming bugs and hardware failures, is
the primary motivation for this praxis.
Finding an adequate solution to a programming problem is not a trivial issue.
Stack Overflow has limitations that can be overcome through better matching between
questioners and accurate answers. Determining what factors make a reliable, valid, and
high-scoring answer can have two major benefits. Firstly, Stack Overflow’s website can
7 Stack Overflow is not a discussion forum.
8 Using a method known as sentiment analysis or opinion mining. SentiStrength and Voyant are two popular sentiment analysis tools.
9 Considering the presence of URLs and the number of uppercase characters.
10 Collected from https://StackExchange.com/sites, after clicking on the Stack Overflow panel.
5
be redesigned to encourage users to tailor their solutions and meet specific standards.
Secondly, askers would have their concerns addressed quickly and more efficiently.
Stack Overflow can be a driving force for transferring knowledge within
development teams while increasing job retention and bolstering code quality. This praxis
seeks to optimize answer searchability and improve knowledge transfer among software
developers in Stack Overflow.
1.3 Problem Statement
Twenty-nine percent of questions posted to Stack Overflow are never answered, and the accepted answers have a low average score,11 barely reaching a value of two.
Low scores tend to overshadow the quality of a solution and, therefore, prevent egalitarian (i.e., evenly distributed) information diffusion within the site and hinder knowledge transfer among software developers.
In general, the high volume of information in a knowledge base (Wang 2010) deters knowledge transfer, causing IT personnel to waste an average of 265 man-hours
(10-15%) annually (National Science Foundation 2017). Even if not all this time is spent browsing online forums or searching for answers in Stack Overflow, IT employees struggle to find reliable solutions to their pressing problems (Jacobs 2018).
11 See Appendix B, Table B-1, Query #1. Source: https://Data.StackExchange.com/stackoverflow/query/edit/988631. The average score is calculated as ∑ _ _ follows: _ . Even though there is considerable score _ _ variability from post to post, high scoring answers tend to have scores in the low hundreds.
6
Similarly, according to the Panopto Workplace Knowledge and Productivity
Report, (a) large companies can lose up to $47 million in productivity each year, (b)
knowledge professionals can waste 5.3 hours weekly, and (c) 81% of workers feel
frustrated when they cannot get the information needed to perform their jobs. Also, 85%
of employees claim that knowledge transfer positively impacts productivity12 (Jacobs
2018). Fortune 500 companies report losing over $31.5 billion a year, despite the huge investments they make on knowledge transfer programs (Babcock 2004). Indeed,
American companies invest over $100 billion in training, but less than 10% of this is invested in work-setting-oriented knowledge transfer support (Benabou and Benabo
1999). In general, some common barriers blocking knowledge transfer are the loss of job security, not wanting to be the bearer of bad news, distrusting coworkers and managers, fear of being ridiculed, and interference with one’s duties. The last two are the main inhibitors of online collaboration among developers.
IT departments rank third—behind customer service and sales—in terms of turnover attributable to limited knowledge transfer, which occurs predominantly during onboarding. Onboarding13 costs are extremely high. Raytheon has reported that the mean
cost of a single job turnover approaches $115,000 and it takes over 100 days to hire a
software developer on average.
Additionally, only 39% of executives leading training programs believe their
organizations are either effective or somewhat effective at transferring knowledge.
12 Productivity is defined as completing a project within a fixed time frame (Sarker et al. 2005).
13 In this context, onboarding refers to the training processes new employees have to go through.
7
Knowledge transfer programs often lack consistency, organizational alignment, time,
budget, tools, and executive support (Raytheon Professional Services LLC 2012).
Babcock (2004, 2) asserted, “Some of the consulting firms’ systems got too much
knowledge put in them, and people got overwhelmed by the amount of [information] they
had to deal with.” The massive amount of technical documentation hinders its usage,
causing IT teams to stop reading training documents altogether. A famous Latin proverb
reads, “Aurum Vergilius de stercore colligit Enni” (Vergil must gather gold from
Ennius’s manure). When trying to find useful information, most IT professionals feel the
same way.
1.4 Thesis Statement
The development of a model that identifies factors leading to a high-scoring
answer using natural language processing (NLP) and regression analysis will streamline
current searching capabilities in Stack Overflow and facilitate knowledge transfer by
increasing developer’s chances of finding answers they need.
Proper communication and documentation keep engineering teams more focused
and productive. According to DeMeyer (1991, 49), “The productivity of an R&D
engineer depends to a large extent on his or her ability to tap into an appropriate network
of information flows,” decreasing the number of software bugs and facilitating
deployments. For instance, knowledge professionals assisting large teams spend more than 13% of their time dealing with inefficient knowledge management processes
(Raytheon Professional Services LLC 2012). Indeed, the more data are available, the more complicated and expensive become the logistic challenges for handling and
8 managing such data (Koman and Kundrikova 2016; Nowlan and Blake 2007). Moreover, a substantial sample of all available documentation is inaccessible or simply lost in the clutter created by excessive data availability.
Compounding matters, the increase of mergers and acquisitions (M&As) within the IT industry and the resulting growth in the size of software-based companies is pointing toward a rapid industry concentration of knowledge and market heterogeneity, a trend that complicates the integration of knowledge repositories (Kalpic 2008). Indeed, exchanging written documents providing “visual anonymity and asynchronous interaction” entails arduous relationships and dealing with relational differences
(Szulanski, Ringov and Jensen 2016, 309).
Even though Stack Overflow has a growing repository currently storing millions of posts, it uses formalized standards to avoid data duplication and to remove irrelevant content. By using Stack Overflow to document solutions, engineering teams could cross- train their members and create a cloud-based back-up to store company knowledge
(Cancialosi 2017). Stack Overflow Teams was created with a specific purpose—to offer an online repository with private access. Private, cloud-based repositories ensure that knowledge professionals can delegate training to individual departments for greater efficiency. Instead of relying on textual documentation that requires time-consuming and expensive processing, teams can also use multimedia-based, interactive tutorials to make knowledge transfer programs more entertaining and appealing. However, no software initiative can succeed without the backing of a formal process for managing and transferring knowledge. Hence, training must be embedded as a part of the engineering department’s culture and mission (Cancialosi 2017), and the chief information officer
9
should be held responsible for the training’s success. By using Stack Overflow to create a knowledge repository, companies can transfer knowledge more effectively, decrease the amount of time invested in training, lower the number of software bugs, and complete projects in less time.
1.5 Research Objectives
The main objective of this research is the prediction of what constitutes an optimal answer in Stack Overflow. By considering the askers’ and community’s feedback, along with the answerers’ reputations and other metadata from a thread’s post, the proposed research product is intended to forecast whether an answer is optimal. The research questions posed relate to the average scores of different answer types and the variability in answerer’s reputation and geographical location in determining how accurate an answer is. Other factors considered were the number of views, word count, relevance, and user reviews.
The best or optimal answer is defined as one that is selected by both the community and the asker independently. Indeed, only a small percentage of solutions are optimal. For example, if a given question received nine proposed solutions, the one with the highest score is the best answer according to the community. However unlikely, there might be multiple answers with the same score; nonetheless, only a single solution14 can be selected by the asker. To clarify, a working solution might not be the best one, as it
14 It may contain multiple steps or stages, or different approaches for solving the same problem.
10 could contain performance issues or might have missed special scenarios, such as boundary values or states.
Both askers and the community can ignore answers or accept and rank solutions.
As depicted by Figure 1-3, an optimal community answer and an asker’s preferred answer need not be the same. Unanswered questions are not part of Figure 1-3, which simply intends to represent the optimal answer region in Stack Overflow. The optimal region is hereby defined as the overlap between the best answer according to the community (highest score) and the best answer according to the asker (personal preference).
All Solutions
Optimal Community Optimal Optimal Asker Answer (highest Region Answer (personal score) preference)
Figure 1-3. Stack Overflow and optimal answer region.
11
As a collateral objective, the present study seeks to optimize the way IT teams handle knowledge transfer processes via Stack Overflow. The proposed research product outlines a framework to predict high-scoring, relevant answers in Stack Overflow.
Achieving such an objective decreases the number of unused documents in internal knowledge bases and does away with long, useless manuals.
1.6 Research Questions and Hypotheses
This research answers instrumental research questions, as follows:
RQ1: What is the average score of accepted answers versus rejected ones?
(comparative)
RQ2: What is the standard deviation of the authors’ (source) reputations?
(descriptive)
RQ3: How frequently do question tags appear listed in the body of a chosen answer? (descriptive)
RQ4: What is the average number of comments per questions and answers?
(comparative)
RQ5: What is the range in the number of suggested edits per chosen answer?
(descriptive)
The research hypotheses draw conclusions via a methodology that applies to parameters of Stack Exchange usage, such that:
12
RH1: If an answer targets a question with a high number of comments, has less
recommended edits than other posts within the same thread, and is succinct (has less than
2,000 characters), then the answer will have a high relevance score.
RH2: If an answer contains a high cumulative tag score and a significant word
frequency (detected via word segmentation, bags of words, and stemming), then the
answer will more likely be selected as the asker’s preferred answer, thus becoming easier
to find.
RH3: If the model outlined in RH1 and RH2 is used in three case studies, more efficient/relevant data acquisition will occur, and Stack Overflow’s high-scoring topics will be easier to find (search optimization).
The first research hypothesis will analyze community-driven, high scoring answers to identify the factors driving relevance scores. For instance, it is hypothesized
that posts with more than six comments will likely have a high score. The population’s
average number of comments is three; hence, answers containing at least twice that
number of comments (i.e., more than six) will meet the specified criterion. Additionally, high-quality posts will have fewer recommended edits than other posts within the thread and contain less than 2,000 characters.
Similarly, the second research hypothesis identifies the factors leading askers to select a given answer as their preferred one. The average cumulative tag’s and bag of words’ scores are 5,054,237 and 50 respectively. Therefore, scores higher than
10,108,474 and 100 would be accepted.
13
The third hypothesis will utilize the factors identified in hypotheses 1 and 2 to build a conceptual framework and locate Stack Overflow answers faster. As mentioned
previously, answers have an average score of two. Any answer with a score greater than
five will be considered to have a high relevance score.
1.7 Scope of Research
The scope of the present research is twofold. First, it defines and manually-tests a
research product that could allow future researchers to understand the factors leading to
high-scoring answers in Stack Overflow. Second, it recommends the implementation of
the proposed research product in a tool (e.g., Python or C-sharp) supporting text analytics
and statistical analysis. In this way, thousands of posts can be analyzed in a matter of
seconds.
1.8 Research Limitations
This research is limited in several ways. First, the data for the praxis were almost
entirely restricted to the year 2018. Stack Overflow’s maturity exceeds eleven years,15
and the quantity and quality of posts from the years following its inception might be
vastly different from the ones used here. Using a dataset from a different time period
would result in significantly different outcomes, considering that Stack Overflow was not
as popular and/or influential in 2008 as it is today.
15 Stack Overflow was founded in 2008 by Jeff Atwood and Joel Spolsky.
14
Second, since Stack Overflow is a highly-technical forum that does not favor social interactions, this praxis’ conclusions will not be applicable to online forums or junior or retired programmers and non-college graduates at large. Having the goal of getting better, more reliable results, the site expects focused answers to problems that prevent developers from doing their jobs.16 Similarly, strict rules are enforced with a fair level of unspoken tension, making mid- and senior-level developers—and not beginners—Stack Overflow’s target audience. A related limitation is users’ bias and fallibility, as individual askers or even the community can select a suboptimal answer based on the way it is written (e.g., it is entertaining or easy to understand).
Third, the present praxis develops a framework that Stack Overflow’s users or administrators/moderators would use; however, it does not include specific implementation details of the software that would automate the process described in the framework.
1.9 Organization of Praxis
The current chapter introduces knowledge transfer and its limitations. It also explores Stack Overflow’s main features and intended audience. The second chapter offers a literature review covering knowledge transfer, defining the concept of information, describing online forums, and analyzing how methods used in related research informed the present praxis. The third chapter explores the methodological approach used—which involves text analytics, partial least squares (PLS), and binary
16 Also known as an impediment or blockage in the Agile/Scrum methodology.
15
logistic regression (BLR)—and explains how the data was collected and sanitized.
Chapter four summarizes the results of the research, and chapter five enumerates the
conclusions, contributions, and recommendations of the praxis. Sources and appendices are included at the end of the paper.
16
Chapter 2: Literature Review
2.1 Introduction
The definitions of information, knowledge, and knowledge absorption and
transfer have been developed extensively in the literature. Such concepts are pivotal to
understanding online forums and Q&A communities, including those catering to
technical users. Stack Overflow’s popularity is heavily dependent on information flows and trust between users.
Additionally, the platform’s user experience (UX) not only relies on availability and usability but also on the successful implementation of a gamification17 paradigm. The
use of reputational scores, badges, and bounties drives answerers to compete against each
other. A high reputation unlocks new privileges and grants access to moderation tools;
badges are considered special achievements and are issued at three levels: bronze, silver,
and gold; and bounties tend to redirect traffic to questions with high visibility, making
them look like missions.18 Therefore, the user offering the highest-scoring answer is
considered to be the thread’s winner, thereby garnering an elevated standing within the
community. Stack Overflow’s Leagues display a scoreboard with the highest ranked users by week, month, quarter, year, and all time.19
17 Gamification is the use of elements and techniques usually found in video games to make a product more appealing and engaging.
18 Sources: https://StackOverflow.com/tour and https://www.YouTube.com/channel/UC2hxYQtGLEkcOMK4h8JRycA/videos.
19 Source: https://StackExchange.com/leagues/1/alltime/stackoverflow. The highest-ranking Stack Overflow user is Jon Skeet, a Java and C-sharp software developer for Google, considered an authority on date and time algorithms. He has become almost a mythical figure within the Stack Overflow community,
17
No research can be successful in determining practical ways the forum can be
improved without a full understanding of the type of information Stack Overflow users
expect and the format that it follows. Moreover, it is important to assert that Stack
Overflow’s true contribution is its ability to offer answers to complex coding questions
and contribute to the success of programming projects. Software developers see the
benefit in data availability but value data readiness even more.
2.2 Information, Knowledge, and Related Concepts
Information can be defined as a set of organized data patterns, but knowledge is
intrinsically more permanent, publicly good, non-excludable, multilayered, and often
indivisible (Bock and Kim 2002; DeMeyer 1991; Roberts 2000). Information is made up
of facts and symbols, whereas knowledge involves skills and expertise; hence, knowledge
is interpreted, personalized, or contextualized information (Zhang 2007).
Knowledge modes are multi-dimensional and are defined as tacit, embodied,20 encoded,21 embrained,22 procedural,23 and embedded24 (Joshi, Sarker and Sarker 2007).
Furthermore, knowledge processes involve the discovery, capture, sharing, and
as all of his [almost 40,000] answers are upvoted and over 200 million visitors have seen his solutions, saving an estimate of 56.8 million hours of development time. He was the first person to win over one million reputation points in Stack Overflow.
20 Partially articulated.
21 Stored in data banks.
22 Ability to interpret underlying phenomenological patterns.
23 Process-based understanding.
24 Contextual, not pre-given.
18
application of information (Alawneh 2016) for improving the success of IT projects,
online collaboration, and the mitigation of risks.
Tacit25 knowledge represents an intuitive understanding grounded in expertise
that is not only harder to articulate than explicit, formal, systematic, codified, or
structured knowledge but cannot be stored in relational databases. Tacit knowledge can
be divided further into collective and overlapping/specific knowledge. Tacit knowledge,
as opposed to articulated knowledge, is informal. It is shared via person-to-person exchanges and spreads locally and across networks of people interested in similar topics26
(see Figure 2-127; Jacobs 2018). Indeed, Stack Overflow relies on the ability of users to
translate implicit knowledge into explicit knowledge by allowing answerers to articulate
solutions that can be stored and later evaluated by the community.
25 Unstructured and actionable knowledge are subsets of tacit knowledge.
26 The interest graph is a digital representation of a given user’s interests. Stack Overflow, as opposed to a social networking site, connects people via interest-links.
27 Source: https://Upload.WikiMedia.org/wikipedia/commons/2/23/Interest_graph_vs_social_graph.png. Labeled for reuse.
19
Figure 2-1. Interest graph.
Knowledge diffusion relates to knowledge shareability and is positively correlated with codifiability. For instance, confidential information tends to be undiffused.
Knowledge diversity, however, is uncorrelated with respect to shareable knowledge
(Slaughter and Kirsch 2006). Even when some tacit knowledge cannot be codified and becomes ambiguous, it can still be shared both internally and externally through actions and practices (Ihrig and MacMillan 2015; Szulanski, Ringov and Jensen 2016). Other categories of knowledge are know-why,28 know-how,29 know-who,30 and know-what31
(Roberts 2000; Santhanam, Seligman and Kang 2007).
28 Causal and contextual.
29 Procedural.
30 Selective social relations.
31 Declarative and factual.
20
Information systems projects are “collective, collaborative, complex, knowledge-
intensive, and creative efforts” (Alawneh 2016, 1). Information technology (IT) is a
subset of information systems (IS). Even though “one of the most important factors contributing to [IT] success is communication and documentation” (DeSimone et al.
1995, 17), one can argue that knowledge repositories can grow so quickly that they render themselves obsolete. Having a predictive framework that retrieves a good answer in a timely matter could be extremely useful to Stack Overflow’s effectiveness and outreach.
2.3 Digging into Stack Overflow
Developers are usually redirected to Stack Overflow from Google’s top search
results, given that Stack Overflow is regarded as a “code-centric knowledge base” (Yang
2016, 225). Stack Overflow also facilitates the process of sharing experiences and expertise in the area of computer engineering.
Stack Overflow is a crowdsourcing platform for knowledge transfer among software developers. Programmers use Stack Overflow as a means to gain knowledge specifically related to “programming languages, API use, configuration management, web frameworks, and web browsers” (Abdalkareem 2017, 55); it allows the community to document bugs and ongoing issues. Some recommendations to improve effectiveness within the site are to “provide direct feedback, assess code snippet quality, and link changes to discussions” (Abdalkareem 2017, 57). Oliveira (2018) proposed connecting users as part of the community and promoting bond-based relationships.
21
Vasilescu (2013) claimed participation in Stack Overflow does not interrupt
developers’ working rhythm; nor does it slow productivity. Highly productive GitHub
committers tend to adopt a more professorial attitude toward Stack Overflow’s forums
and transfer their knowledge in a more egalitarian way via micro-, intermediate-, and
macro-analysis. Vasilescu’s paper informed this research in terms of linking Stack
Overflow to other platforms, but it did not address the quality of answers posted or the
relation of this quality to effective knowledge transfer.
Previous research has been conducted in Stack Overflow to improve the
maintainability of the site and optimize its responsiveness. Zhang et al. (2015) outlined
how duplicate questions make site maintenance harder, and his solution—
DupPredictor32—outperformed Stack Overflow’s search engine by 40.63% (Zhang et al.
2015).
Chua (2015) used metadata33 and content34 for a predictive framework to determine how a high number of downvotes, a low number of tags, and a small number of characters within a question would increase the question’s likelihood of being answered. Yang (2016) analyzed code snippets in Python, JavaScript, Java, and C-sharp to determine answers’ correctness and completeness.
32 DupPredictor uses Latent Dirichlet Allocation (LDA) to extract topic distributions; Porter stemming to reduce words to their root form; and WVTool to remove stop words.
33 Popularity, participation, asking time, derived role, and derived popularity.
34 Level of detail, specificity, accuracy, clarity, and socio-emotional value.
22
Oliveira (2018) has argued that Stack Overflow exhibits a participation imbalance
between United States users (individualistic, contributors, active) and Asian users
(collectivist, lurker, passive) via a value-sensitive-design (VSD), tripartite methodology.
Additionally, Oliveira (2018) has asserted that values (such as productivity and
reputation) showcased in Stack Overflow, better align with a Western, individualistic
perspective. Stack Overflow can be considered a structural resource, as the site and its users establish a dynamic feedback loop.
Zhang (2018) proposed a technique named RASH that, via lexical and historical analysis of application programming interfaces (APIs), maps APIs with 70% accuracy and accelerates the resolution of questions and the saving of developer’s time. This is relevant because, when compared to other question types, API-related questions take three days longer for answers yet are viewed 200% more, according to Zhang (2018).
Ninety-two percent of Stack Overflow’s questions are answered with a wait-time averaging 11 minutes for the first answer and 24 days to get a solution or accepted answer (Masudur-Rahman 2018). Masudur-Rahman (2018) created a prediction model to detect the best answer (defined as the most upvoted) within an unanswered question by using five metrics: answer rejection rate, last access delay, topic entropy, reputation, and vote to which applied four different methods: lexical (readability), semantic (topic similarity), user behavior, and popularity. However, Masudur-Rahman (2018) did not apply his findings to already-answered questions.
23
2.4 Knowledge Transfer
Knowledge transfer means getting the right information to the right people at the
right time (Cancialosi 2017). There are three epistemological stances considered for
knowledge transfer: cognitivistic,35 connectionistic,36 and autopoietic37 (Joshi, Sarker and
Sarker 2007).
Indeed, knowledge transfer has been a cornerstone for social and technological
advances throughout history. For instance, the printing press was a major improvement for transferring knowledge over oral tradition. Some classical works, such as The
Aeneid,38 The New Testament,39 and The Divine Comedy40 garnished more popularity
simply by having the right language channel. By selecting the right mode of
communication, knowledge transfer becomes easier to manage and ultimately more
effective.
Nonetheless, technology products are different from works of art, as they are
often the outcome of substantial embedded/internalized tacit knowledge. The present
praxis argues against the widely-accepted idea that mostly tacit (as opposed to explicit)
knowledge creates a strategic advantage by defying the economic law of scarcity. Also, it
35 Fixed, stored data.
36 Knowledge transfer resembles interconnected networks.
37 Self-produced, autonomous, co-evolved, unified, and unshareable.
38 Written in Latin.
39 Written in Greek.
40 Written in Italian.
24
challenges the perception that a technology’s popularity makes it more accessible or
easier to understand.
In Laudon and Laudon (2006), Proctor & Gamble (P&G) solves the problem of its
annual growth rate slowdown from 5% to 2.6% via knowledge management-driven
initiatives and information systems. The case study describes how P&G implemented
InnovationNet for bolstering collaboration, enabling better surveys, removing siloes,
implementing agent-based modeling and simulations, and triggering synergy across
teams. The authors observed that software like MatrixOne was used to automate and
standardize knowledge components.
Most large companies understand the difficulty of transferring knowledge from
experts to novices. Employees may feel overwhelmed by the amount of information they
are expected to master and discouraged by the lack of formal processes to guide them.
Changchit (2003) proposed an intelligent and interactive system that measures perceived
usefulness and ease of use to achieve such a goal. Besides, good knowledge transfer
practice and unconscious learning of managerial systems, norms, and values take
employees to an expert level via pattern-recognition (Swap et al. 2001). Nonetheless,
formal knowledge transfer processes can stifle disruptive innovation models.
Some solutions have been proposed regarding knowledge transfer optimization; however, none of them deal with the issue of documentation removal, peer-review, or updating. Alawneh (2016) proposed a framework for dealing with experts’ diversity when preparing a memory repository. Geiger (1994) suggested an evaluation technique
25
for tacit knowledge. Dillon, Graham, and Aidells (1972) analyzed delivery methods41 for
both individuals and groups. DeMeyer (1991), Roberts (2000), and Sarker et al. (2005)
elaborated on the ability of information and communication technologies (ICTs) to
enhance the transferability of knowledge, even though they acknowledged the relevance
of face-to-face interactions vis-à-vis relational proximity or localness of knowledge.
Iyengar, Sweeney, and Montealegre (2015) used a MIMIC (multiple indicators and multiple causes) model, which applies statistical validation techniques for a set of formative indicators measuring internal IT use, knowledge transfer effectiveness, absorptive capacity, and financial growth. Slaughter and Kirsch (2006) analyzed the impact of knowledge transfer on software process improvement (SPI) initiatives in terms of composition (types of mechanisms) and intensity (how frequently mechanisms are used), converging into accumulative techniques.
Cancialosi (2017) recommended six steps42 for achieving effective knowledge
transfer. Guo (2016) developed a joint learning framework that combines transfer43 and
active44 learning, considering the different source and target domains for sample labeling
of seed parameters, cross-class knowledge transfer, and statistical classifiers. Zero-shot
learning uses models without labeled data45 (Guo 2016). Rohrback (2011) pointed out
41 Video and practice.
42 Make it formal (1), create duplication (2), train, train, train (3), use systems (4), create opportunities (5), and be smart when using consultants (6).
43 System-driven.
44 Human-driven.
45 Unlabeled data can lead to information loss.
26
that traditional multi-class classification and error-prone zero-shot learning methods tend
to use a limited number of object classes.
Akshatha (2017) studied knowledge transfer’s impact on job satisfaction and
employee turnover. However, females, top management, and employees with five or
more years of experience in their roles are underrepresented in the study. According to
the United States Department of Labor, workers between the ages of 55 and 64 have three
times the tenure46 and job security that workers between the ages of 25 and 34 have
(Bureau of Labor Statistics, United States Department of Labor 2012). Furthermore, the
lack of educational background in non-professional jobs tends to influence tenure
negatively. Computer and engineering occupations outperform most others in terms of tenure, except for careers with the federal government.
The following research papers have potential applicability to the field of knowledge management and highlight important techniques when transferring knowledge in online forums. For example, He (2018) and Nowlan (2007) used intelligent agents47 for faster decision-making via collaborative filtering48 as a solution to data sparsity.49
Wang (2010) explored semantic-oriented knowledge transfer for review rating.50
Shahbandi (2018) optimized error-prone mapping algorithms in the field of robotics.
46 Amount of time a person has worked for a given employer.
47 Human proxies containing a brain, body, society, and human-agent interaction.
48 Collaborative filtering can be narrow or general, being a technique applied by recommender systems.
49 This concept is very common in natural language processing and it relates to the problem of not observing sufficient corpus’s data to accurately model a given language. It is also known as data sparseness or paucity.
50 Transfer rating.
27
Jiang (2018) recommended a novel algorithm for auxiliary textual datasets called transfer
spectral clustering (TSC). The research outlined above deals with the optimization of
text-based knowledge, and a few of the techniques presented, such as collaborative
filtering, were partially used for this paper’s data analysis.
On top of optimization, one can see how trust, reputation, and culture all have a
considerable influence on knowledge management. “Knowledge tends to move
horizontally and vertically along structural lines” (Slaughter and Kirsch 2006, 305), and it
drives IS profitability due to the knowledge resources becoming scarce and hard to
purchase (Zhang 2007). As recommended by Gist (1989) three decades ago, behavioral modeling51 can be beneficial for knowledge transfer.
Simon et al. (1996) compared traditional and non-traditional computer training
techniques in a military setting and considered trainees’ cognitive abilities, motivations,
and training environment. For Simon et al. (1996), knowledge was a process and not a
product. Amiryany (2014) researched knowledge-based acquisitions failures, especially
within high-tech companies. Amiryany (2014) divided them into formal acquisition
structures, communication tools and practices, and on-the-job learning activities. Chang
(2012) examined the positive impact52 that IT-related software outsourcing has on technical knowledge (Chang and Gurbaxani 2012), economies of specialization, and
51 Process demonstrating the behaviors driving performance and efficacy.
52 Up to 6% gains.
28
accumulated knowledge correlating transfer with intensity, leverage, size, and consistency (Chang and Gurbaxani 2012).
Nevertheless, companies usually seek outsourcing solutions due to scarce resident capabilities; hence, expected benefits might be overestimated, and tacit/implicit
knowledge transfer could be nonexistent. Knowledge transfer improves due to the
external nature of the provider, and wage benefits—not knowledge transfer
optimization—drives business decisions. More importantly, companies often do not
consider online forums, such as Stack Overflow, to manage their knowledge repositories
due to security and legal concerns. Stack Overflow can indeed manage private knowledge bases for private and public companies alike.
A segment of the research literature is oriented toward profitability. Ihrig and
MacMillan (2015) introduced a two-dimensional map going from tacit to explicit53 and
from proprietary to widespread.54 The goal of this Cartesian model was to shift data
points toward the top-right corner, with the primary goals being to map knowledge assets,55 assemble multi-functional teams within diverse unit levels, identify new
opportunities—such as licensing—and contextualizing and re-discovering knowledge for
new applications.
One of the main barriers to knowledge transfer adoption relates to non-codified
data. It is estimated that 90% of all stored data is unstructured. Furthermore, with the
53 Y axis: going from unstructured to structured.
54 X axis: going from undiffused to diffused.
55 Knowledge assets can be hard/technical or soft/managerial.
29
ubiquity of digital communication and the prevalence of an aging workforce (Benabou
and Benabo 1999) comes an increase in the number of knowledge repositories, the
maintenance of which becomes harder due to “data explosion and information overload”
(Delen and Al-Hawamdeh 2009). Far from resolving the problem, new technologies— data mining, data warehousing, web crawling, and cheap storage—have made it more pervasive. For instance, Koman and Kundrikova (2016) claimed that technological development triggers constant information production and pointed out that one-fourth of managers are confused by the concept of big data. Good knowledge management—core competencies, areas of expertise, intellectual property, and deep pools of talent (Ihrig and
MacMillan 2015; Khandelwal and Gottschalk 2003)—must avoid the big data trap.
There are several proposed solutions in the literature, but only a fraction of them deal with the issue of massive knowledge repositories. Moreover, there are even fewer solutions that take into account the value of online repositories. As pointed out by Simon
(1971) almost 50 years ago, “A wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it” (Simon 1971, 40-41).
Delen and Al-Hawamdeh (2009) developed a holistic knowledge management framework within a federated system that addresses a lack of emphasis on textual information. Their methodology was to harvest tacit knowledge via interview sessions, mentoring, [memorable, truthful, and positive] story-telling, analogies, and metaphors56
(Swap et al. 2001). The researchers’ sequential, three-step process included extracting
56 Personalization, socialization, and internalization strategies.
30
data from information sources, consulting know-how materials, and finally generating
actionable knowledge. The drawback of such an approach is that knowledge transfer is
more efficient within an iterative, network-based, and collaborative environment with
feedback effects; not within a step-by-step recipe.
Kim, Song, and Jones (2011) explored a social-cognitive selection framework for
a knowledge acquisition strategy in virtual communities via goal-setting theories from a demand-and-supply perspective. This research has provided abundant information on
acquisition methods within virtual communities; it has not, however, expanded on inter-
group knowledge transfer.
Finally, the knowledge engineering paradox states, “[A] more capable individual
is unable to successfully explicate his/her knowledge in forms that can be internalized by
other less capable team members” (Joshi, Sarker and Sarker 2007, 331), especially as the
knowledge gap increases. In fact, high-performers may be detrimental to departmental
knowledge transfer goals. The reason, according to DeSimone et al. (1995), is that expert
innovators work in spurts and reach their highest levels of productivity only rarely. Here,
one encounters a conundrum: Content quality decreases as more people contribute to the
knowledge base (further discussed in the following sections).
2.5 Online Forums
Even when they are hardly mentioned in developer meetings and are infrequently
referenced within internal documentation, online forums are vital to software developers’
successes. Q&A sites and technical forums depend on content quality and communal trust. For instance, Attiaoui (2017) used the belief measure of expertise (BME) for the 31
detection of authoritative users; authoritative users are a major drawback within online
communities. Specifically, any Stack Overflow user with a reputation greater than 2,400
points is regarded as an expert, without considering the number of accepted answers the user has contributed to the site.
Szulanski, Ringov, and Jensen (2016) claimed that judicious knowledge transfer timing contributes to increased delivery accuracy, regardless of whether a front-loading57
or back-loading58 mode is used. The researchers asserted that knowledge transfer
becomes sticky when it is noteworthy. Although foundational to domain knowledge
transfer, Szulanski’s research mostly ignores the role of online communities.
Santhanam, Seligman, and Kang (2007) researched knowledge transfer processes
and routine work elements during the post-implementation stage. Additionally, they
evaluated network effects with end users and colleagues within and between groups. As
they pointed out, transactive memory59 can be detrimental to proper knowledge transfer.
Joshi, Sarker, and Sarker (2007) explored the impact of knowledge within
development teams that support information systems. They found capabilities that do not
tend to play a significant role but seek a degree of resonance between senders and
receivers. The authors also explored communication theory in terms of messages,
senders, receivers, channels, transmissions, and communication effects. Even though the
57 More affordance for tacit knowledge exchange is allocated to the initiation of the transfer than to its implementation.
58 Less affordance for tacit knowledge exchange is allocated to the initiation of the transfer than to its implementation.
59 When person A uses person B as a memory aid.
32
article offers advice in the area of knowledge transfer, it mostly ignores the importance of
online forums for IT teams.
According to Sarker et al. (2005, 214), “Not much research has been conducted to
examine knowledge transfer within groups in virtual communities, especially those that
span time and space (i.e., virtual teams).” However, Sarker et al.’s research could benefit
from validation through the use of a comprehensive dataset.
A major problem with online forums is the failure to attract quality answers.
Hence, maximizing answerability and answer quality rates would be beneficial for both
the platform and the community of users it hosts. By using natural language processing
(NLP) for measuring users’ engagement, Kowalik (2016) found that elderly60 users of
Stack Overflow tend to have higher reputational scores; similarly, seniors61 answer slightly more often than juniors but ask half as many questions.
2.6 Summary and Conclusions
After exploring the concepts of information and knowledge, one appreciates their relevance to the area of online forums and specifically to the success of Stack Overflow.
Stack Overflow’s main goal is to transfer knowledge between software developers effectively and improve upon existing techniques used in Q&A sites by applying
60 In terms of age and not technical expertise.
61 In terms of technical expertise and not age.
33
gamification and other related techniques, such as token economy, for behavior
reinforcement.
Therefore, it is paramount to determine which factors are significant in Stack
Overflow’s posts. Identifying such factors would optimize the site’s responsiveness and
decrease the amount of time users spend finding an answer.
The domain of knowledge management is filled with challenges. Those were considered while reviewing the literature, and some of the shortcomings found are listed below:
1. Who decides who the expert (knowledge originator) is? Most companies look at
data and disregard sources.62 Sources are critical to knowledge acquisition and
can be divided into three categories: dyadic,63 published,64 and grouped65 (Kim,
Song and Jones 2011; Sarker et al. 2005).
2. Some managers adopt big data as a panacea, ignoring that sometimes “less is
more.”
3. Close-mindedness, biases, and overconfidence are underestimated when it comes
to knowledge management.
62 Static or dynamic.
63 One-to-one dialogue between the knowledge recipient and the provider via direct communication.
64 Many-to-many relationships between knowledge providers and recipients that can be helpful to noise-, fidelity-, and credibility-related factors.
65 Open-venue exchange of knowledge among multiple recipients and sources.
34
4. Users ignore that, besides people, artifacts, and organizational entities (such as
Stack Overflow) can also transfer knowledge (Alawneh 2016).
5. Knowledge transfer represents more than a back-up plan kicking in when critical
members of the team leave (Cancialosi 2017); it prevents members of the team
from leaving. In other words, an expert’s letter of resignation should never kick-
off knowledge transfer.
6. There is a lack of video recordings (DeMeyer 1991; Zhang 2007) of code reviews
and lunch and learn sessions.
7. Learning materials tend to be long and convoluted (Jacobs 2018).
8. Prevalence of medium dysphoria, which is when users get confused learning from
similar materials stored in different formats.
9. Knowledge repositories become a disorganized dumping ground and contribute to
organizational waste (Jacobs 2018).
10. Peer-review is commonplace in academia but barely applied to IT departments’
knowledge bases.
11. Competence/reputational trust66 is often regarded as more important than
benevolence/emotional trust (Khandelwal and Gottschalk 2003; Ko 2010; Roberts
2000). Also, managers fail to take advantage of consultants’ expertise and Q&A
forums. Reputation tends to serve as an information filter due to the massive
amount of available information (Sarker et al. 2005).
66 Formal process.
35
12. Reverse and spontaneous mentoring67 have an impact on organizational
knowledge, work effectiveness, and job success/mobility (Geiger 1994; Quick
MBA 2010; Scandura 1992). Starcevich (1999) recommends a “power-free, two-
way, mutually beneficial relationship” between mentors and protégés.
13. Few companies conduct knowledge audits or appoint a CKO (Chief Knowledge
Officer), Vice President of Knowledge Management, CLO (Chief Learning
Officer), CICO (Chief Intellectual Capital Officer), CINO (Chief Innovation
Officer), or CPO (Chief People Officer) (Babcock 2004).
14. Unwillingness to change, archaic corporate politics, and irrelevance affecting
knowledge-based programs (Babcock 2004). Hence, knowledge management
should be a company-wide problem, not an IT problem.
15. Only considering materials from the same field and industry, usually exclusively
available in English.
The behavior- and culture-driven solutions found in the literature are extensive but do not cover the problem of excessive data availability. DeSimone (1995),
Khandelwal (2003), and Zhang (2007) describe the importance of incentives and reward systems, attitudes toward risk and reward, hiring and training initiatives (DeSimone et al.
1995; Khandelwal and Gottschalk 2003; Zhang 2007), on top of informal, as-needed, spontaneous, and circumstantial (Benabou and Benabo 1999) processes for creating a thriving organizational culture that encourages knowledge transfer. Nonetheless, Bock
67 Protégé-to-mentor.
36
and Kim (2002) claim that incentives, as opposed to motivations, are counterproductive.
Stack Overflow agrees with this view.
IT developers should never have to seek advice from the nicest or more
experienced teammate but from the most knowledgeable one. As it turns out, “Bigger is
[not] more important than better” (Kalpic 2008, 4). It is not information volume (Sarker
et al. 2005) that matters but diversity, veracity, sparsity, and velocity (Ihrig and
MacMillan 2015; Koman and Kundrikova 2016).
The Stack Overflow’s platform comes with several embedded problems. For example, Ragkhitwetsagul (2017) surveyed Stack Overflow registered users and visitors68 to evaluate outdated code, their legal implications, and their detrimental effects when introducing vulnerabilities. Unfortunately, most programming languages have a short lifecycle, causing many good, reliable articles to be deprecated within a matter of months.
Also, virtually no answerers include software licenses in their code snippets. Sixty-nine percent never validate against licensing69 conflicts. Nine percent of developers copy code on a daily basis. Sixty-four percent actively reuse code from the site and find problems with it but never report it, and nine percent experience legal issues.
In the following chapter, this research will explore novel techniques for collecting data and finding optimal answers within Stack Overflow, at the same time analyzing
68 Unregistered users.
69 In accordance with the Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) document.
37 significant factors that contribute to more efficient knowledge transfer among technical users.
38
Chapter 3: Methodology
3.1 Introduction
This chapter categorizes Stack Overflow as a data repository and displays the
praxis’ outline in a flowchart that includes the study’s methodology. The chapter goes on
to describe how the data70 was collected and sanitized—a vital step before conducting
text analytics. The last section introduces the uses of text analytics and multiple
regression analysis, specifically partial least squares (PLS) and binary logistic regression
(BLR), as methodological approaches in the praxis. The predictive model will help to
find optimal answers within Stack Overflow and, in so doing, identify significant factors
contributing to more efficient knowledge transfer among software developers.
Delen and Al-Hawamdeh (2009) described a method for posting an unstructured
question to a generic system in Figure 3-1.71 The steps described are adopted by Stack
Overflow almost in their entirety. Table 3-1 relates Delen’s holistic framework to Stack
Overflow’s behavior. The table serves as a recapitulation of the definitions categorizing
Stack Overflow as a knowledge repository.
70 All referenced data was available in the Stack Exchange database and could be accessed via SQL queries.
71 Source: Dursun Delen, Suliman Al-Hawamdeh. Workflow for posing a query or an unstructured question to the system in A Holistic Framework for Knowledge Discovery and Management (Communications of the ACM 52 (6), 2009), 144. Used with permission.
39
Figure 3-1. Delen’s holistic framework for knowledge management discovery.
40
Table 3-1. Mapping Delen’s Holistic Framework into Stack Overflow
Delen’s Holistic Framework Stack Overflow Equivalent Start interaction User accesses Stack Overflow’s website via a web browser or
mobile app. Authentication is optional.
Submit query Users try to find an answer to a question.
Search knowledge repository The Stack Overflow search engine looks for a valid answer.
Answer found / If an answer is found and it is satisfactory, the user ends the
satisfaction level interaction (best-case scenario). If the answer is found, but it is
not satisfactory, the user creates, refines, and submits a new query
or question.
Answer not found / If there is no answer for a posted question, other members of the
satisfaction level community step in to offer a resolution. If at least one proposed
solution is valid, the user interaction ends; otherwise, the process
restarts.
Create knowledge nugget Each question, answer, comment, or suggested edit becomes a
for KD post.
Knowledge Depository (KD) All posts are stored in the Stack Exchange Data Explorer.
Note. A few steps from Delen’s framework are ignored by Stack Overflow, such as identify human experts, consult with human experts, identify other human experts, and direct user to specialized discussion boards. Indeed, one goal of Stack Overflow is to keep discussions focused and concise and to eliminate unnecessary steps that could delay askers from getting their questions answered. Source: see footnote 71.
41
Supportability analysis72 (see Figure 3-2), a technique commonly applied to
logistics management to bolster efficiency, was applied in the planning of this paper. The
process is a sequence of eight steps. After evaluating the research requirements (step 1)
and determining the problem and thesis statements (step 2) in accordance with current
research (step 3), SEDE data was retrieved and sanitized (step 4). Upon selecting two
statistical methods (step 5), a new conceptual framework was built (step 6) and tested
(step 7). The analysis finishes with conclusions and recommendations (step 8), urging
other researchers to implement the proposed framework following standard system development procedures.
72 Source: Military Standard 1388-1.
42
Figure 3-2. Supportability analysis.
3.2 Data Collection and Analysis
The data source for this research is the Stack Exchange Data Explorer (SEDE) repository, which contains a comprehensive data dump of Stack Overflow’s posts. The
43
SEDE73 was updated in September 2019 and is publicly accessible. It stores almost 400 files and 60 gigabytes of data. The repository contains questions, answers, comments, tags, and user information from Stack Overflow.
The Stack Exchange API is an online, freely available platform with read-only rights that allow researchers to query the Stack Overflow’s data repository and retrieve near-real-time information about the site’s activities. Results are limited to 50,000 records per query, probably to prevent denial of service (DoS) attacks and to guarantee the SEDE’s responsiveness.
With thousands of posts entered into Stack Overflow daily, data collection and cleaning was a challenge in this study. Several steps were needed to ensure the sample data was representative of the population. First, posts were categorized via a stratified74 and systematic75 [without replacement] random76 approach. Second, case studies were
used to validate the reliability and accuracy of the results. Such a testing methodology
73 Sources: https://Meta.StackExchange.com/questions/2677/database-schema-documentation-for-the- public-data-dump-and-sede and https://Data.StackExchange.com/stackoverflow/query/835150/list-all- fields-in-all-tables-on-sede.
74 This is a statistical method that breaks the population into subpopulations and then into samples from each subpopulation. It introduces certain risks. The criteria for determining subpopulations might be biased, and convenience sampling might result. Convenience sampling is a non-randomized method that simply looks at the data that is most easily available; e.g., the first ten objects of a set.
75 This is a statistical method that selects samples from a population following ordered framing. For example, in a nine-person population where each person is assigned a unique, sequential identifier, a systematic sampling would be selecting elements three, six, and nine.
76 Random sampling ensures that all elements in the sample are selected by chance, considering that each element had the same probability of being chosen.
44
established benchmarks to corroborate the accuracy of the proposed model’s predictive
power.
The critical areas for data collection and the interrelations within the data are
illustrated in Figure 3-3. The proposed model considers question and answer posts’
metadata, user information, comments, feedback, badges, tags, and votes (see Appendix
A).
Figure 3-3. Relationship between critical data fields collected.
45
The data was transformed using a variation of the Calefato approach (Calefato
2018). The Calefato approach converts variables with a high variance into numerical categories. For example, the literature observes that users’ reputations in Stack Overflow are highly variable—scores range between 1 and over 1,000,000—and prevent even distribution of information across different topics. Through the use of the Calefato method, new users received a value of 1,77 low reputation users a value of 2,78 established
users a 3,79 and expert/trusted users a 4.80
As part of the data cleaning process, generic questions such as “I keep having
problems” or discussion triggers like “Tell me what your favorite programming language
is” were removed from the dataset. Stack Overflow favors answers that are concrete and
verifiable, disfavoring answers that are vague, open-ended, and conducive to a
fragmented discussion. This does not imply, however, that there is but a single way of
solving concrete problems.
The current praxis uses quantitative methods (i.e., regression analysis) for data
analysis. Additionally, the proposed framework evaluates post quality and determines the
conditions that lead to successful answers— having the highest community score in the
77 Current score < 10. Reputational scores cannot have a negative value as the minimum score will always be one, no matter how many times a user is downvoted. Historical reputation graphs, however, can contain metadata to determine a user’s true score, which can be negative. For simplicity’s sake, this research assumes all reputational scores in Stack Overflow to be positive integers greater than or equal to one.
78 Current score in the range [10, 1,000).
79 Current score in the range [1,000, 20,000].
80 Current score > 20,000.
46 thread and being selected by the user who posted the question—in Stack Overflow. The results are presented in chapter 4.
3.3 Research Methods
The following sub-tasks are warranted by the need to ensure the validity of the predictive framework that is developed as part of chapter 4:
1. Collect a sample from the SEDE website and ensure that the sample is both
representative and systematic.
2. Compare and contrast answers with a high positive score (upvoted) versus
answers with a negative score (downvoted). This process facilitates the
identification of the factors driving scoring mechanisms within Stack Overflow
and their relationship to high-quality posts.
3. Determine if, as hypothesized, an answerer’s reputation is the driving factor
leading to answer selection. For example, if a question is answered by a member
having a higher-than-average reputation, then the answer has better odds of being
selected.
As previously stated, the following research hypotheses will be evaluated:
RH1: If an answer targets a question with a high number of comments, has less recommended edits than other posts within the same thread, and is succinct (has less than 2,000 characters), then the answer will have a high relevance score.
RH2: If an answer contains a high cumulative tag score and a significant word frequency (detected via word segmentation, bags of words, and stemming), then the
47
answer will more likely be selected as the asker’s preferred answer, thus becoming easier to find.
RH3: If the model outlined in RH1 and RH2 is used in three case studies, more efficient/relevant data acquisition will occur, and Stack Overflow’s high-scoring topics will be easier to find (search optimization).
Due to the neutral style of Stack Overflow’s posts, sentiment analysis and affinity tables were not the best methods for evaluating the content of questions, answers, and comments. Another factor that contributed to this choice was the existence of strong vetting, social grooming, and heavy editing against emotionally-leaning posts within the platform.
The research product, depicted below in Figure 3-4, justifies the use of the research methods—NLP, PLS, and BLR.
NLP branches out from artificial intelligence with the goal of deciphering human language in a way that is understandable to computer systems (Garbade 2018). NLP is hard to implement due to the nuances of human communication, permeated by slang and sarcasm. Two techniques used by NLP are syntax and syntactic.81 Most NLP conducted
on this praxis was applied via the ProWritingAid tool.
PLS will be applied to determine whether an answer has a high score or not. PLS
is based on multivariate regression and principal component analysis (Roy 2015). Even
81 Some syntax methods are: lemmatization, morphological segmentation, word segmentation, part-of- speech tagging, parsing, sentence breaking, and stemming (Garbade 2018). Some semantics-based processes are: named entity recognition (NER), word sense disambiguation, and natural language generation (Garbade 2018).
48
though this statistical method is often used when more than one dependent, non-binary variable exists; the nature of answer scores within Stack Overflow made it a perfect fit for the analysis. In fact, PLS is less likely to cause overfitting than linear regression (Roy
2015) as it simultaneously addresses variability and correlation, which appear to be endemic problems in Stack Overflow. Furthermore, PLS is able to handle data noise competently.
Similarly, BLR is regarded as a statistical classification technique that analyses a set of independent variables to predict a categorical—usually binary—dependent variable. This justifies its use for predicting when an answer is more likely to be tagged as the preferred solution.
Figure 3-4. Research product.
49
Upon completing the data collection and preliminary cleaning process,82 the methodology to find the best (optimal) answer was divided into four major areas: data preprocessing, text analytics, post metadata, and author metadata.
First, data preprocessing (see Figure 3-5) involved balancing and randomizing the collected post sample for the year 2018 in a way that the data used was evenly distributed across the entire year. For example, assuming there were a total of 1,000 posts, there could not be 500 posts from January and 500 posts from December; a better distribution would be approximately 80 posts for each month of the year. This process is known as systematic sampling. Additionally, fields having a high variability and wide ranges, such as reputation and answer and question scores, were converted into numerical categories.
The categories used were high=1 and normal=0.
82 Process of elimination of blank, incomplete, or expired records.
50
Figure 3-5. Data preprocessing.
Second, natural language processing (NLP) and text analytics (see Figure 3-6) were applied over text-based columns, such as questions, answers, and comments. The textual content was also analyzed before modification. The number of words, word frequency (word cloud), and the number of characters were calculated for each post’s title and body. Moreover, a capitalization ratio was compiled by counting the total number of uppercase characters and dividing it by the total number of characters on each post. For instance, the word Hello, will have a capitalization ratio of 0.2.83 There are limitations to this metric, especially with SQL-related posts having sample code. SQL commands
83 There are a total of five characters. Only one uppercase character exists: H. There are four lowercase characters: e, l, l, and o. The rate is calculated as follows: 1 / 5 = 0.2.
51
(statements and functions) are usually written in uppercase, which could offset the
capitalization ratio.
Figure 3-6. Text analytics.
The ProWritingAid v2.0 software tool—grammar checker, style editor, and writing mentor—was manually used to calculate grammar and spelling scores.
Readability scores84 were calculated by looking at the numbers of syllables per word and
words per sentence, thus determining the educational grade-level of the writing; usually,
shorter words and sentences yield better results. Automating this process would have
involved a third-party integration with the text and grammar checking API,85 but that step
84 Readability measures should have a score greater than 60. Glue and transition indexes should be under 40% and over 25% respectively. Some of the metrics used were: Flesch-Kincaid Grade, Coleman-Liau, Automated Readability Index, Dale-Chall Grade, Flesch Reading Ease, and Dale-Chall Ease.
85 Source: https://ProWritingAid.com/en/App/API.
52 would fall outside the scope of the current praxis. (See Figures 3-7 and 3-8 for additional details regarding the ProWritingAid software.)
Figure 3-7. ProWritingAid toolbar in Microsoft Office 365 ProPlus.
Figure 3-8. ProWritingAid summary.
53
The next step was the removal of all articles (e.g., a, the) and “glue” words (e.g.,
some, much, just) from the text. Once the text was reduced, stemming was applied.
Stemming is a technique that reduces a word to its root form, also known as the lemma.
Stemming differs from lemmatization in that the root word obtained might not be an
actual known word in spoken human language. (The programmatic process that
accumulates the bags of words’ scores, using buzzwords and code tags, is described in
chapter 4.)
Third, post metadata (see Figure 3-9) was analyzed using available statistical techniques, specifically, regression via partial least squares and binary logistic regression.
Post metadata refers to timestamps and metrics from posts. Posts can appear as questions, answers, comments, or suggested edits.
Figure 3-9. Post metadata.
54
Fourth, author metadata (see Figure 3-10) was examined, as were activity and
reputational scores, geographical location, and the number of profile views at the time of the analysis.
Figure 3-10. Author metadata.
The framework also considered whether one or more solutions have high scores, determining if a code snippet is present, and counting the number of hyperlinks86 within
it. These three metrics, along with the analysis of the four major areas presented above,
will classify the solution as optimal or not.
86 Also known as URLs (uniform resource locators), or locations/addresses of internet resources.
55
Chapter 4 further elaborates on the nuances of NLP, PLS, and BLR, and introduces examples of how the different metrics were calculated, testing the proposed conceptual framework using three case studies.
56
Chapter 4: Results
4.1 Introduction
As an important step prior to collecting the data, the Stack Overflow’s Annual
Developer Survey Results for 2018 were reviewed. The survey was conducted by Stack
Overflow in January 2019 and included over 100,000 participants from 183 countries.87
The main takeaways from the survey were a wide adoption of DevOps, machine learning, and artificial intelligence techniques and their ethical implications, in addition to contrasting individual perspectives, and gender-determined88 goals within the Stack
Overflow community. Most users reside in the United States, Western Europe, and India, and the overwhelming majority of developers do coding as a hobby. Furthermore, more than 80% of developers rely on Stack Overflow when learning new programming strategies and techniques, a high level that should be considered with caution, given the survey’s participants. For that reason, survey data was not used as part of the statistical analysis.
87 Source: https://Insights.StackOverflow.com/survey/2018#overview.
88 Less than 10% of users at Stack Overflow are female, but these stats are slowly shifting.
57
Figure 4-1. Stack Overflow users’ age, educational level, and wake-up time.
As depicted in Figure 4-1, most surveyed users were between 25 and 45 years old, had a bachelor’s degree, and started working between 7:00 am and 9:00 am.89 This information was useful in determining the frequency of contributions and post quality on the forum and in noticing how most developers were below 46 years of age.90
Furthermore, Figure 4-2 shows a list of widely-used methodologies, platforms, databases, frameworks, and programming languages. This data was vital for selecting and
89 These times were adjusted to reflect the local time of the users answering the survey.
90 In most professions, tenure is more prevalent among employees who are between 55 and 64 years of age.
58
developing three appropriate case studies that could validate the proposed research
product’s predictive power.
Figure 4-2. Most popular technologies in Stack Overflow.
4.2 Data Collection and Preprocessing
After the main trends from the year 2018 were determined, it was easier to proceed with the process of data collection, cleaning, and transformation.
The Stack Exchange Data Explorer contains a query editor91 that retrieves post
metadata from Stack Overflow. The query used appears below and shows that the main
database tables used were Posts, Users, Comments, Suggested Edits, Post Tags, and Tags.
91 Source: https://Data.StackExchange.com/stackoverflow/query/new.
59
As previously mentioned, the data was limited to the year 2018, and the answers’ scores had to be less than -2 or greater than 5. All blank/NULL or incomplete records were removed from the data prior to analysis. The query, output, and execution plan from
SEDE are part of Figures 4-3, 4-4, and 4-5 respectively.
SELECT DISTINCT CASE WHEN p1.[AcceptedAnswerId] > 0 THEN '1' ELSE '0' END AS 'Answer_Accepted' , p.[Score] AS 'Answer_Score' , p.[CreationDate] AS 'Answer_Creation_Date' , p1.[Score] AS 'Question_Score' , p1.[CommentCount] AS 'Question_Number_Of_Comments' , p1.[ViewCount] AS 'Question_Number_Of_Views' , p1.[Tags] AS 'Question_Tags' , p.[Body] AS 'Answer_Text' , p.[CommentCount] AS 'Answer_Number_Of_Comments' , u.[Reputation] AS 'Author_Reputation' , u.[Views] AS 'Author_Views' , SUM(t.[Count]) AS 'Tag_Sum' , ISNULL(SUM(c.[Score]), 0) AS 'Comment_Sum' , ISNULL(COUNT(se.[Id]), 0) AS 'Suggested_Edits_Count' FROM [Posts] p INNER JOIN [Posts] p1 ON p.[ParentId] = p1.[Id] INNER JOIN [Users] u ON p.[OwnerUserId] = u.[Id] LEFT JOIN [Comments] c ON p.[Id] = c.[PostId] LEFT JOIN [SuggestedEdits] se ON p.[Id] = se.[PostId] INNER JOIN [PostTags] pt ON pt.[PostId] = p1.[Id] INNER JOIN [Tags] t ON t.[Id] = pt.[TagId] WHERE p.[PostTypeId] = 2 AND YEAR(p.[CreationDate]) = 2018 AND YEAR(p1.[CreationDate]) = 2018 AND (p.[Score] > 5 OR p.[Score] < ‐2) GROUP BY p.[Id], p.[CreationDate], p1.[Score], p1.[CommentCount] , p1.[ViewCount], p1.[Tags], p1.[AcceptedAnswerId], p.[Score] , p.[Body], p.[CommentCount], u.[Reputation], u.[Views] HAVING COUNT(se.[Id]) > 0 AND SUM(c.[Score]) > 0 ORDER BY p.[Score] DESC;
Figure 4-3. Query used for data collection in SEDE.
60
Figure 4-4. SEDE’s partial results.
Figure 4-5. Query’s partial execution plan.
Table 4-1 was generated in Microsoft Excel for Office 365 v16.0 32-bit. Fields with high variability (i.e., answer’s score and question’s number of views) were
61 converted into binary variables using a variation of the Calefato approach (Calefato
2018), described in the third chapter (methodology). All operations were completed on a
Dell XPS 8300 desktop computer with Windows 7 Professional, Service Pack 1, 16 gigabytes of RAM, and an Intel(R) Core(TM) i7-2600 3.40 gigahertz processor.
Table 4-1. Sample data for a single post
Columns / Metadata Original Transformed Microsoft Excel Formula Data Data Answer_Accepted 1 1 -
Answer_Score 3,264 1 =IF(CELL >
AVERAGE(Answer_Score) * 2,
1, 0)
Answer_Creation_Date 1/15/2018 3 MM/DD/YYYY
8:35:15 pm =WEEKNUM(CELL)
Question_Score 2,421 1 =IF(CELL >
AVERAGE(Question_Score) *
2, 1, 0)
Question_Number_Of_ 17 1 =IF(CELL >
Comments AVERAGE(Question_Number_
Of_Comments) * 2, 1, 0)
62
Columns / Metadata Original Transformed Microsoft Excel Formula Data Data Question_Number_Of_ 360,255 1 =IF(CELL >
Views AVERAGE(Question_Number_
Of_Views) * 2, 1, 0)
Question_Tags (renamed Bags_Of_Words_Score) -6> Answer_Text (renamed If you 1 = IF(LEN(CELL) < 2000, 1, 0) as take Answer_Text_Length) advantage of Answer_Number_Of_ 19 1 =IF(CELL > Comments AVERAGE(Answer_Number_ Of_Comments) * 2, 1, 0) Author_Reputation 87,398 1 =IF(CELL > AVERAGE(Author_Reputation) * 2, 1, 0) Author_Views 21,365 1 =IF(CELL > AVERAGE(Author_Views) * 2, 1, 0) 63 Columns / Metadata Original Transformed Microsoft Excel Formula Data Data Tag_Sum 107,472,957 1 =IF(CELL > AVERAGE(Tag_Sum) * 2, 1, 0) Comment_Sum 870 1 =IF(CELL > AVERAGE(Comment_Sum) * 2, 1, 0) Suggested_Edits_Count 114 1 =IF(CELL > AVERAGE(Suggested_Edits_ Count) * 2, 1, 0) Note. The CELL keyword varies depending on the value to be transformed. Averages are calculated for the entire referenced column. Due to the high number of line breaks, commas, single-quotes, double-quotes, and special characters within the answer’s body content column, an exhaustive data cleaning process was conducted. One of the Excel formulas used is shown here: =CLEAN(TRIM(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(H2, CHAR(13), ""), CHAR(10), ";"), ",", ";"))). Not only did these operations make it easier to analyze the data, but they also linearized all textual content, making it compatible with a comma- separated-values (CSV) file format. The encoding applied was UTF-8.92 Moreover, code syntax, such as C++, which interfered with the compilation process, was substituted with an acceptable value (e.g., C-plus-plus). 92 This character encoding method is a variation of the Unicode Transformation Format. 64 The calculation of BOW’s scores was implemented in Microsoft Visual Studio Professional 2019 v16.3 with the .NET Framework v4.8. The basic functionality, summarized in Figure 4-6, required the creation of two CSV files, one containing the original raw data and a second file for storing the values compiled in the utility C-sharp program. A total of 1,485 lines were analyzed, excluding the header column, which was omitted from the analysis. Each line was expected to have 14 elements—one for each column displayed on Table 4-1. Only the elements below, however, were used for compiling BOW’ scores: Question_Tags (position 6,93 element 7) Answer_Text (position 7, element 8) Tag_Sum (position 11, element 12) Using regular expressions and ignoring case sensitivity, for every matching tag within the body of an answer, five points were added. If a code tag was present, ten points were added. (Other data validation techniques and error handling implemented in the program are not included as part of Figure 4-6.) 93 In C-sharp, element positions start at zero. For instance, the last position of an array with 10 elements would be 9. 65 string[] allLines = File.ReadAllLines(_rawDataFile, Encoding.UTF8); foreach (string line in allLines) { lineNumber++; if (lineNumber == 1) { continue; } string[] fields = line.Split(','); if (fields.Length != 14) { File.AppendAllText(_textScoresFile, "0" + "\n"); continue; } string questionTagList = fields[6]; string answerBody = fields[7]; string tagSum = fields[11]; string[] tagList = questionTagList.Split('>'); int count = 0; foreach (string tag in tagList) { string cleanTag = tag; cleanTag = cleanTag.Replace(">", string.Empty); cleanTag = cleanTag.Replace("<", string.Empty); if (!string.IsNullOrEmpty(cleanTag)) { foreach (Match match in Regex.Matches(answerBody, cleanTag, RegexOptions.IgnoreCase)) { count = count + 5; } } } if (answerBody.ToLower().Contains(" Figure 4-6. C-sharp functionality for calculating BOW scores. The link between the data retrieved from the SEDE website and the quantitative methods applied is shown below in Figure 4-7 and further developed in the following section. Only SEDE data (and not survey information) was used to perform the statistical analysis. 66 Figure 4-7. Linking data source and quantitative methods. 4.3 Predictive Models In the current section, the following two research hypotheses will be evaluated: RH1: If an answer targets a question with a high number of comments, has less recommended edits than other posts within the same thread, and is succinct (has less than 2,000 characters), then the answer will have a high relevance score. RH2: If an answer contains a high cumulative tag score and a significant word frequency (detected via word segmentation, bags of words, and stemming), then the 67 answer will more likely be selected as the asker’s preferred answer, thus becoming easier to find. The dataset was thoroughly evaluated to determine if it was evenly distributed throughout the year 2018, a process known as systematic sampling. The total sample size consisted of 1,485 records (n = 1,485; see Figure 4-8). The box and whisker and histogram plots shown below prove the data was adequate for further analysis. Post frequency is consistent from month to month, except for a steady decline toward the end of the year. This can be explained by two main factors: Many employees use accrued vacation time that was set to expire toward the end of year, and most of the year’s holidays occur in the months of November and December (i.e., Thanksgiving Eve, Thanksgiving Day, Christmas Eve, Christmas Day, and New Year’s Eve). 68 Figure 4-8. Post frequency distribution per week. The research questions posed in chapter one also served as the basis for the statistical analysis. As expected, the average score of accepted answers was 55.53% 69 higher than the average score of rejected ones.94 The standard deviation calculated for overall author’s reputation was 87,645,95 which shows how much variability is prevalent within Stack Overflow in terms of technical expertise and user engagement. Similarly, almost three-quarters of answers96 contained at least one tag or buzzword from the question. The average number of comments per answer was 70.07% higher than the average number of comments per question.97 The range of suggested edits per chosen answer was 1 to 340,98 which hints at varying levels of post quality. Quantitative analysis (multivariate regression) and, specifically, partial least squares with a confidence interval of 95% (α = 0.05) was conducted in Minitab v18.1. The total number of principal components—not to be confused with significant variables or factors—specified was five. 94 For accepted answers, the formula used was =AVERAGEIF(Answer_Accepted, 1, Answer_Score), which returned an average score of 39.38. For rejected answers, the formula used was =AVERAGEIF(Answer_Accepted, 0, Answer_Score), which returned an average score of 25.32. 95 Excel formula: =STDEVA(Author_Reputation). 96 Excel formula: = 1,033 / COUNTA(Question_Tags). The numerator value of 1,033 was compiled in C- sharp by using an accumulator method that added one every time an answer’s body had at least one question’s tag: each post could only be added once. The resulting pairing or matching was 69.56%. 97 For answers, the formula used was =AVERAGE(Answer_Number_Of_Comments), which returned an average of 4.83. For questions, the formula used was =AVERAGE(Question_Number_Of_Comments), which returned an average score of 2.84. 98 The range’s floor or minimum value was calculated in Excel as follows: {=MIN(IF(Answer_Accepted = 1, Suggested_Edits_Count))}; whereas the ceiling or maximum value was calculated in the following fashion: {=MIN(IF(Answer_Accepted = 1, Suggested_Edits_Count))}. 70 A few researchers have used PLS for dealing with binary, dependent response variables. Some of the scenarios considered were: Rodríguez-Pérez (2018) demonstrated that PLS does not perform well while analyzing factors having a high dimensionality or when applied to datasets containing few records. Yin (2018) and Cao (2018) used discrete, modified binary particle swarm optimization (BPSO) algorithms via PLS and adaptive models, such as remote sensing.99 Within a hurdle model,100 Zhang (2018) explored how a binomial framework guided the dependent variables’ binary outcomes, which can be zero or positive. Sun (2019) predicted simulated phenotypical traits of different populations using binary indicators in the response variable, which outperformed FST101 and EigenGWAS102 methods. Conversely, Murase (2018) applied PLS to material discrimination of near- infrared ray (NIR) band—stored as binary classifiers. The first research hypothesis tries to predict when an answer might have a high relevance score. Nonetheless, the data collected was unbalanced because only 20.54% of answers (305 out of 1,485) had a high score. Random sampling was applied to the dataset 99 Remote sensing is a reliable tool for measuring [inland] water quality (Cao 2018). 100 A hurdle model is based on binary or before-and-after thresholds. 101 A fixation index is a score of population differentiation that evaluates genetic structures (Sun 2019). 102 Combination of genome-wide association studies and eigenvector decomposition (Sun 2019). 71 to ensure that the percentage of high-scoring answers approached 50%.103 After randomization, the new sample had 305 high-scoring and 294 regular answers (n = 599, 50.92%). The best predictive accuracy obtained using all independent variables approached 70%. To prevent overfitting, it was then decided to choose significant factors by looking into standard coefficient and loading plots (see Figures 4-9 and 4-10). Upon careful examination, the following six variables104 were selected, and the model was re-run with three components, getting a prediction accuracy of 73.96%: Question_Score Question_Number_Of_Views Author_Views Comment_Sum Suggested_Edits_Count Answer_Text_Length 103 The Excel formula used for randomizing and balancing the data was: =IF(CELL = 1, RANDBETWEEN(1, 80), RANDBETWEEN(81, 100)). 104 The first three independent variables were pulled from the coefficient and loading plots, whereas the remaining three were included from the first research hypothesis. 72 PLS Coefficient Plot (response is Answer_Score) 5 components 0.3 0.2 0.1 Coefficients 0.0 -0.1 1 2 3 4 5 6 7 8 9 10 11 12 Predictors Figure 4-9. PLS standard coefficient plot for answer score. PLS Loading Plot 0.4 Question_Number_Of_Views Answer_Text_Length 0.3 Question_Score 0.2 Answer_Accepted 0.1 Comment_Sum 0.0 -0.1 Author_Views Question_Number_Of_Comments Component 2 Component -0.2 Author_Reputation Answer_Number_Of_Comments -0.3 Bags_Of_Words_Score -0.4 -0.5 Tag_Sum Suggested_Edits_Count 0.0 0.1 0.2 0.3 0.4 0.5 Component 1 Figure 4-10. PLS loading plot for answer score. 73 The response plot is presented below in Figure 4-11. The resulting statistics are also included. Large residuals in the Y-axis are common in poor models, whereas X- residuals can identify outliers (Roy 2015). PLS Response Plot (response is Answer_Score) 3 components 1.2 1.0 0.8 0.6 0.4 Calculated Response 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Actual Response Figure 4-11. PLS response plot for answer score. Table 4-2. Analysis of Variance Source DF SS MS F P Regression 3 35.886 11.9620 65.24 0.000 Residual Error 595 113.814 0.1913 Total 598 149.699 74 Table 4-3. Model Selection and Validation Components X Variance Error R-squared 1 0.366245 114.172 0.237327 2 0.544771 113.818 0.239690 3 0.662258 113.814 0.239720 Table 4-4. Coefficients Answer_Score Standardized Answer_Score Constant 0.323585 0.000000 Question_Score 0.255689 0.205350 Question_Number_Of_Views 0.226486 0.181333 Author_Views -0.078108 -0.042821 Comment_Sum 0.207882 0.157511 Suggested_Edits_Count 0.134519 0.101144 Answer_Text_Length 0.043785 0.033549 Minitab’s fit coefficients were used to determine the predicted value via a variable cutoff of threshold. Predicted values were then compared to actual values, and an accuracy score was compiled. For example, out of 599 records, 443 of them were correctly predicted, for a 73.96% accuracy score. For easier testing, minimum and maximum fit coefficients were retrieved, and a histogram was built using fit coefficients. 75 This contributed to a better understanding of the dynamics and behaviors guiding the prediction model. The results seemed to confirm the first hypothesis, given that the answer’s comment and character count, besides suggested edits and other related factors like question score and author reputation, were found to be significant. Unsurprisingly, questions’ scores and the number of views have a positive influence on good answers. Good, relevant questions have higher demand and receive more traffic than do poor, irrelevant questions; hence, there is a greater number of users who can upvote the solutions within the thread. An interesting finding is that author’s views have a negative correlation on answer scores. A plausible explanation could be that, when trying to offer feedback for an erroneous answer, knowledgeable users might visit the author’s page to give their criticism in a direct, yet private fashion. While the first research hypothesis predicts high-scoring answers (community’s best answer), the second hypothesis seeks to predict whether an answer is chosen by the individual user who posed the question. In this case, due to a lack of cross-validation controls and to the existence of a myriad of subjective factors guiding individual user behavior, one may anticipate the results to be less reliable than those obtained for the first hypothesis: This is especially relevant if the prediction accuracy is greater than the one obtained in the first hypothesis. 76 Once again, the data for chosen or accepted answers was skewed, and 77.85% of the questions were marked as accepted (1,156 posts out of 1,485). Random sampling was conducted, and the resulting dataset contained only 332 chosen answers (n = 639, 51.96%).105 The statistical method used for analysis was BLR (i.e., the answer is selected or not), with Logit as the link function and a two-sided, confidence level of 95% and Pearson residuals. All generated plots were standardized. The five significant factors are displayed below: Answer_Score Comment_Sum Suggested_Edits_Count Bags_Of_Words_Score Tag_Sum 105 The Excel formula used was: =IF(CELL = 1, RANDBETWEEN(1, 20), RANDBETWEEN(21, 100)). CELL refers to individual values within the Answer_Accepted column. 77 Table 4-5. Deviance table Source DF Adj. Dev. Adj. Mean Chi-Square P-Value Regression 5 280.901 56.180 280.90 0.000 Answer_Score 1 205.182 205.182 205.18 0.000 Comment_Sum 1 10.021 10.021 10.02 0.002 Suggested_Edits_Count 1 54.669 54.669 54.67 0.000 Bags_Of_Words_Score 1 2.345 2.345 2.35 0.126 Tag_Sum 1 2.537 2.537 2.54 0.111 Error 633 603.963 0.954 Total 638 884.864 Table 4-6. Model Summary Deviance R-squared Deviance R-squared (adjusted) AIC 31.75% 31.18% 615.96 Table 4-7. Coefficients Term Coefficient SE Coefficient VIF Constant -1.14200 0.14100 Answer_Score 2.99300 0.24600 1.28 Comment_Sum -1.15700 0.36300 1.35 Suggested_Edits_Count 2.41900 0.35800 1.62 Bags_Of_Words_Score -0.00200 0.00127 1.04 Tag_Sum -0.52300 0.33100 1.58 78 Table 4-8. Goodness-of-fit Tests Test DF Chi-Square P-Value Deviance 633 603.96 0.791 Pearson 633 659.21 0.228 Hosmer-Lemeshow 8 14.97 0.060 The prediction accuracy of chosen answers was 80.91% (517 correct predictions out of 639 total records). Even though it was possible to predict chosen answers successfully, the bags of words and cumulative tag scores were not as significant as the answer score and the suggested number of edits. More research might be needed to explore the significance of bounties and badges in answer selection by users who pose questions. The regression equation describing the model used to predict answer selection is the following: P(1) = exp(Y') / (1 + exp(Y')) Y' = -1.142 + 2.993 Answer_Score - 1.157 Comment_Sum + 2.419 Suggested_Edits_Count - 0.00200 Bags_Of_Words_Score - 0.523 Tag_Sum With the confirmation of the first two hypotheses, the next step consists of validating the research product. As described in the following section, the initial two hypotheses serve as the foundation for implementing the case studies. 79 4.4 Case Studies In this section, the research product is tested by applying it to three case studies, spanning .NET-, API-, and JavaScript-related questions. The third hypothesis is the following: RH3: If the model outlined in RH1 and RH2 is used in three case studies, more efficient/relevant data acquisition will occur, and Stack Overflow’s high-scoring topics will be easier to find (search optimization). Figure 4-12 shows the link between the two hypotheses covered in the previous section and the third hypothesis developed in the current one. Figure 4-12. Relationship model between research hypotheses and case studies. 80 The first case study consisted of finding a Stack Overflow question related to .NET via the Stack Exchange Data Explorer or Google search engine. Hence, the following inquiry was found: do we use double equal sign (==) or .Equals() for comparing strings in C#?106 All four areas described in the methodology chapter were applied. Data preprocessing: o Converted answer scores into categories (high: > 50, normal: >10, and low: <= 10): Out of 16 answers, 11 were low, 3 were normal, and 2 were high. o Converted question scores into categories (high: > 50, normal: >10, and low: <= 10): Since a single question was analyzed and its score was 491, it was tagged as high. Text analytics: o Compiled the cumulative BOW score (unique tags plus code tags). There were no code tags found within the optimal answer (highest score and asker’s choice), but the BOW score was 10 because the equals tag appeared twice. Standard tags (e.g., c#, .net, equals) add five points every time they appear, whereas code tags add 10 points per occurrence. o Calculated the number of characters per post’s body and title: Counting spaces, the question’s title had 37 characters, and its body had 393 characters. o Generated the number of words per post and calculated word frequency: There were 6 words in the title and 55 words in the body. 106 Source: https://StackOverflow.com/questions/814878/c-sharp-difference-between-and-equals. 81 o Capitalization (uppercase) ratio: Out of 235 characters in the answer, only 6 were uppercase characters: 6 / 235 = 2.55%. o Removed articles, stop and “glue” words, and applied stemming:107 After using the NLTK (Natural Language Toolkit) in Python, the following words remained: ==, expression, type, System.Object.ReferenceEquals.Equals, virtual, method, override, version, string, comparison, and content (11 words remaining out of 38 original words). o Grammar/spelling score: Using ProWritingAid, the grammar score was determined to be 22/100 while the spelling score was 35/100. Post metadata: o Day of the week: Saturday, 05/02/2009. o Time of the day (morning, afternoon, evening, etcetera): Afternoon, 13:39 or 1:39 pm. o Number of answers/comments/edits and question/answer/comment score: There were a total of 16 answers, 45 comments, and 11 recommended edits. The question had a total score of 491, whereas the best answer had a score of 395, and the best comment had a score of 49. Author metadata: o Activity score: 8,590 actions. o Reputation: 351,765. o Geographical location: California, United States. o Number of profile views: 51,630. 107 Stemming consists of reducing a given word to its root form. 82 Other: o High score: Yes. o Code snippet: No. o Number of URLs: 2. Out of the 18 conditions listed, a total of 16 were met, for an 88.89 combined score. The algorithm determined that the answer selected by the asker and the community was indeed the best/optimal one. The remaining two case studies are presented in tabular form (see Table 4-9). 83 Table 4-9. Case studies in API and JavaScript Conditions / Case Studies Case Study 2 – API108 Case Study 3 - JavaScript109 Totals 12 / 18 = 66.67% 14 / 18 = 77.78% Answer categories Low: 3 High: 2 Question categories Low: 1 High: 1 BOW score 10 20 Character count 206 671 Word count 40 116 Capitalization ratio 3 / 206 = 1.46% 19 / 671 = 2.83% Words after using NLTK 16 23 Grammar and spelling scores 10; 10 21; 11 Day of the week Saturday Tuesday Time of the day Afternoon Morning Comment and edit metadata 1; 0 10; 0 Author activity score 234 1,204 Reputation 343 18,775 Geographical location Singapore, Singapore Klagenfurt, Austria Number of profile views 252 1,125 High score No Yes Code snippet No Yes Number of URLs 2 2 Note. Underlined cell values mean that other answers within the thread met optimal criteria. Both cumulative percentages are consistent with the expected results and post quality. For instance, even though both answers were chosen by the asker, the lower scores in the API solution are correctly reflected on the lesser total percentage. 108 Source: https://StackOverflow.com/questions/5595334/paypal-integration. 109 Source: https://StackOverflow.com/questions/12797118/how-can-i-declare-optional-function- parameters-in-javascript. 84 As proven by the implementation and analysis of the three case studies, the proposed research product could increase the speed of knowledge transfer within Stack Overflow, a fact that could optimize the platform’s searching mechanisms. 85 Chapter 5: Discussion and Conclusions 5.1 Discussion Stack Overflow meshes (or “plug-and-plays”) two important techniques. First, it uses a Q&A format without following a traditional discussion forum paradigm; second, its signature layout resembles that of Wiki-based platforms in that moderators are chosen by the user community and interactions depend on collaborative trust. Pivotal to that communal trust are reputational scores, a metric that becomes relevant because all developers within the site are regarded as peers. Stack Overflow adopts the role of a lingua franca, a bridge connecting developers. Stack Overflow, however, reveals a major paradox, contrasting the interests and goals of individuals with those of the community. Such interests and goals are not always aligned and, in some instances, are mutually exclusive. For instance, since only one person can mark an answer as accepted, selected/chosen answers go on longer without being marked as resolved. 5.2 Conclusions All three research hypotheses were confirmed. Firstly, PLS analysis was able to predict high-scoring answers with a 73.96% accuracy. Secondly, chosen answers could be correctly predicted 80.91% of the time via BLR. Thirdly, the three case studies completed in the previous chapter showed encouraging results and opened the possibility for further research and possible automation of the research product. These outcomes 86 could be useful for optimizing search inquiries110 for the millions of developers who access Stack Overflow every day. The leading factors that prevented the improvement of prediction accuracy percentages were the massive amount of duplicate information within the site, the variability of answerers’ expertise on the subject, and the responders’ competence in the use of the English language. Moreover, there are many questions that are not duplicates; however, they are similar enough to cause confusion among users and negatively affect post quality. Such drawbacks limit Stack Overflow’s ability to attract a greater number of high-quality posts. As shown by Stack Overflow, proper knowledge transfer in the area of software development defies and possibly defeats the law of economic scarcity. A technical-savvy, cross-trained team is more productive than a team pervaded by siloes of expertise. Listening to the Stack Overflow community could inform business processes, company planning, and even stock valuations that are heavily influenced by emerging programming techniques or breakthroughs. Without a doubt, adequate knowledge transfer must be a vital component of every IT department. When properly applied, knowledge transfer and intellectual capital can become an industry differentiator, a driver for innovation and a means of achieving competitive advantage (Bock and Kim 2002). Stack Overflow can guide this innovation from the bottom-up (i.e., from developers to managers to executives). 110 This can be accomplished by translating scoring mechanisms into sitemaps and SEO (search engine optimization) within Stack Overflow. 87 5.3 Contributions to Body of Knowledge This praxis contributes to the body of knowledge in two important ways: 1. By introducing the definition of optimal answer region, the nature and outcome of the statistical analysis are more holistic than previous research conducted on this area. 2. By building upon existing knowledge management techniques and knowledge transfer methodologies, the proposed research product facilitates finding optimal answers within Stack Overflow, with approximately 80% prediction accuracy. 5.4 Recommendations for Future Research The amount and quality of the information found in Stack Overflow could inspire future research in several areas. A few recommendations are outlined below: Implementation and automation of conceptual frameworks related to knowledge transfer, knowledge diffusion, and talent acquisition Analysis of online forum’s dynamics from a sociological perspective Of primary importance, the conceptual framework developed here could be built into a configurable software tool for increased predictive power for locating high-scoring answers within Stack Overflow. Hence, researchers could expand the methodologies used in this praxis by scaling the dataset and using more up-to-date information. Also, they could implement a software tool to automate the processes of data collection, cleaning, and analysis. 88 Second, a case study could be conducted to evaluate the effectiveness of implementing Stack Overflow for Teams111 in a development setting, especially if the knowledge managed is sensitive, privileged, or confidential. Stack Overflow for Teams is a private, cloud-based, searchable, knowledge management platform with a centralized source-of-truth that seamlessly integrates all capabilities from the Stack Overflow environment into companies. The information is handled reliably and securely for little over $200 per user per annum, driving internal team collaboration and lowering overhead costs. Furthermore, chatting tools like Slack can be easily integrated as part of Stack Overflow for Teams to track article updates and notify users when new questions and answers become available. According to claims by Stack Overflow, employees using the platform can save up to 20 hours per month; internal support requests are resolved 20% faster, and, most importantly, solutions are stored for a future consult in a knowledge base format. In fact, companies like Expensify112 have successfully implemented Stack Overflow for Teams. Third, research could be conducted in Stack Overflow at the user level to determine areas of expertise and areas requiring additional training for a given developer. For instance, one could design an m times n knowledge matrix for finding subject area experts. Rows would be represented by the letter m, and they would contain the names or aliases of all engineers. Columns would be represented by the letter n, and they would 111 Source: https://StackOverflow.com/teams. 112 Expensify is a software company offering expense management systems for individuals and companies. 89 store a running score per each software product, interface, or programming language.113 A Likert scale ranging from zero to five would be used (zero being no knowledge in a given area, and five being the highest plausible, attainable expertise). The knowledge matrix could also include contact details and availability and would work similarly to yellow pages (Delen and Al-Hawamdeh 2009). A good methodology for analyzing this data could be cluster analysis, which groups sets of data into a cluster or collection of data objects through unsupervised classification with the objective of finding high intra- class similarity and low inter-class similarity while detecting hidden patterns. Fourth, an analysis of Stack Overflow’s data could be invaluable to Human Resources (HR) departments and hiring/recruiting agencies. Hiring good programmers is an extremely challenging and tedious task, made so by the existence of a candidate- driven market, pervaded by an imbalance between a high number of open positions (demand) and a low number of available, reliable developers (supply). By researching technological and industry trends found within the site, technical recruiters could determine which tools are in greater demand (e.g., most questions from the month of June 2019 are related to the Rust programming language) and find knowledgeable users within a geographical area for immediate screening.114 HR teams could also look at the aforementioned data to guide their staffing initiatives and fine tune their onboarding programs to decrease turnover rates. Indeed, with over 50 million developers visiting 113 It would be useful to add a dictionary with detailed descriptions for each topic or tool. 114 This process can be also regarded as recruiting advertising and/or branding. 90 Stack Overflow every month, the platform represents an excellent recruiting115 alternative via the Careers 2.0 initiative. Fifth, sociological studies could be conducted to evaluate the factors leading to site popularity, user reputation, and community interaction within online forums. Stack Overflow is part of a bigger platform called Stack Exchange. Stack Exchange expands its customer base by using the Area 51 approach, a method in which approximately two hundred core users on a given discipline are recruited and surveyed with the goal of branching out into a new site targeting specialized users (e.g., chefs or video editors). Essentially, only an average of 10% of all feedback is actionable. Studying social dynamics within individual sites and how they relate to each other could help in understanding the information flows guiding how online communities grow and evolve. Similarly, researchers could determine what kind of personality traits are prevalent in Stack Overflow, thus assessing the level of diversity within the community and what type of role models users seek (i.e., factors driving peer recognition). 115 Sources: https://www.StackOverflowBusiness.com/talent/platform, https://StackOverflow.blog/2011/02/23/careers-2-0-launches/, and https://StackOverflow.com/jobs/get- started. Careers 2.0 recommends portfolio-based over résumé-based screening. 91 References Abdalkareem, Rabe, Emad Shihab, and Juergen Rilling. 2017. "What Do Developers Use the Crowd For? A Study Using Stack Overflow." IEEE Software 34 (2): 53-60. Akshatha S., S. Senthil Ganesh. 2017. "Exploring the Determinants of Exit Experience: Results from the Survey of Ex-Employees in India." Coimbatore. International Conference on Communication and Signal Processing. Alawneh, Ali Ahmad, and Rashad Aouf. 2016. "A Proposed Knowledge Management Framework for Boosting the Success of Information Systems Projects." IEEE Engineering and MIS (ICEMIS) 1-5. Amiryany, Nima, and Jeanne W. Ross. 2014. "Acquisitions that Make your Company Smarter." MIT Sloan Management Review 55 (2): 13. https://SloanReview.Mit.edu/article/acquisitions-that-make-your-company-smarter/. Attiaoui, Dorra, Arnaud Martin, and Boutheina Ben Yaghlane. 2017. "Belief Measure of Expertise for Experts Detection in Question Answering Communities: Case Study Stack Overflow." Procedia Computer Science 112: 622-631. Babcock, Pamela. 2004. "Shedding Light on Knowledge Management." HR Magazine 49 (5): 46- 51. https://www.Shrm.org/hr-today/news/hr-magazine/pages/0504covstory.aspx. Benabou, Charles, and Raphaël Benabo. 1999. "Establishing a Formal Mentoring Program for Organization Success." National Productivity Review: The Journal of Organizational Excellence 19 (4): 7-14. Bock, Gee W., and Young-Gul Kim. 2002. "Breaking the Myths of Rewards: An Exploratory Study of Attitudes about Knowledge Sharing." Information Resource Management Journal 15 (2): 14-21. 92 Bureau of Labor Statistics, United States Department of Labor. 2012. "Employee Tenure." Web. https://www.Bls.gov/news.release/archives/tenure_09182012.pdf. Calefato, Fabio, Filippo Lanubile, and Nicole Novielli. 2018. "How to Ask for Technical Help? Evidence-Based Guidelines for Writing Questions on Stack Overflow." Information and Software Technology 94: 186-207. http://Dx.Doi.org/10.1016/j.infsof.2017.10.009. Cancialosi, Chris. 2017. Six Key Steps to Influencing Effective Knowledge Transfer in your Business. Accessed November 22, 2018. https://www.Forbes.com/sites/chriscancialosi/2014/12/08/6-key-steps-to-influencing- effective-knowledge-transfer-in-your-business/#77b27e945fe6. Cao, Yin, Yuntao Ye, Hongli Zhao, Yunzhong Jiang, Hao Wang, Yizi Shang, and Junfeng Wang. 2018. "Remote Sensing of Water Quality Based on HJ-1A HSI Imagery with Modified Discrete Binary Particle Swarm Optimization-Partial Least Squares (MDBPSO-PLS) in Inland Waters: A Case in Weishan Lake." Ecological Informatics 44: 21-32. Chang, Young Bong, and Vijay Gurbaxani. 2012. "Information Technology Outsourcing, Knowledge Transfer, and Firm Productivity: An Empirical Analysis." MIS Quarterly: Management Information Systems 36 (4): 1043-1063. Changchit, Chuleeporn. 2003. "An Investigation Into the Feasibility of Using an Internet-Based Intelligent System to Facilitate Knowledge Transfer." Journal of Computer Information Systems 43 (4): 91-99. Chesbrough, Henry W. 2003. "A Better Way to Innovate." Harvard Business Review 81 (7): 12- 3. Chua, Alton Y. K., and Snehasish Banerjee. 2015. "Answers or No Answers: Studying Question Answerability in Stack Overflow." Journal of Information Science 41 (5): 720-731. 93 Davenport, Thomas H. 1997. "Ten Principles of Knowledge Management and Four Case Studies." Knowledge and Process Management 4 (3): 187-208. Delen, Dursun, and Suliman Al-Hawamdeh. 2009. "A Holistic Framework for Knowledge Discovery and Management." Communications of the ACM 52 (6): 141-145. DeMeyer, Arnoud. 1991. "Tech Talk: How Managers are Stimulating Global R and D Communication." Massachusetts Institute of Technology, Sloan Management Review 32 (3): 49-58. DeSimone, L. D., George N. Hatsopoulos, William F. O'Brien, Bill Harris, and Charles P. Holt. 1995. "How Can Big Companies Keep the Entrepreneurial Spirit Alive?" Harvard Business Review 73 (6): 183-189. Dillon, Peter C., William K. Graham, and Andrea L. Aidells. 1972. "Brainstorming on a Hot Problem: Effects of Training and Practice on Individual and Group Performance." Journal of Applied Psychology 56 (6): 487-490. Garbade, Michael J. 2018. A Simple Introduction to Natural Language Processing. Accessed 09 06, 2019. https://BecomingHuman.ai/a-simple-introduction-to-natural-language- processing-ea66a1747b32. Geiger, Adrianne H. 1994. "Measures for Mentors." The Training and Development Sourcebook 46 (2): 65–67. Gist, Marilyn E., Catherine Schwoerer, and Benson Rosen. 1989. "Effects of Alternative Training Methods on Self-Efficacy and Performance in Computer Software Training." Journal of Applied Psychology 74 (6): 884-891. Guo, Yuchen, Guiguang Ding, Yuqi Wang, and Xiaoming Jin. 2016. "Active Learning with Cross-Class Knowledge Transfer." AAAI. Beijing. 1624-1630. 94 He, Ming, Jiuling Zhang, and Jiang Zhang. 2017. "MINDTL: Multiple Incomplete Domains Transfer Learning for Information Recommendation." China Communications 14 (11): 218-236. doi:10.1109/CC.2017.8233662. He, Ming, Jiuling Zhang, Peng Yang, and Kaisheng Yao. 2018. "Robust Transfer Learning for Cross-Domain Collaborative Filtering Using Multiple Rating Patterns Approximation." In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM. 225-233. Ihrig, Martin, and Ian MacMillan. 2015. "Managing your Mission-Critical Knowledge." Harvard Business Review 93 (1). Iyengar, Kishen, Jeffrey R. Sweeney, and Ramiro Montealegre. 2015. "Information Technology Use as a Learning Mechanism: The Impact of IT Use on Knowledge Transfer Effectiveness, Absorptive Capacity, and Franchisee Performance." MIS Quarterly: Management Information Systems 39 (3): 615-641. Jacobs, Susan. 2018. Developing a Strategy to Facilitate Knowledge Transfer. Accessed November 22, 2018. https://www.LearningSolutionsMag.com/articles/developing-a- strategy-to-facilitate-knowledge-transfer. Jiang, Wenhao, Wei Liu, and Fu-Lai Chung. 2018. "Knowledge Transfer for Spectral Clustering." Pattern Recognition 81: 484-496. https://Doi.org/10.1016/j.patcog.2018.04.018. Joshi, Kshiti D., Saonee Sarker, and Suprateek Sarker. 2007. "Knowledge Transfer Within Information Systems Development Teams: Examining the Role of Knowledge Source Attributes." Decision Support Systems 43 (2): 322-335. 95 Kalpic, Brane. 2008. "Why Bigger Is Not Always Better: The Strategic Logic of Value Creation Through Mergers and Acquisitions." Journal of Business Strategy 29 (6): 4-13. Accessed January 22, 2018. https://Search-ProQuest-com.ProxyGw.wRlc.org/docview/202682959. Khandelwal, Vijay K., and Petter Gottschalk. 2003. "Information Technology Support for Interorganizational Knowledge Transfer: An Empirical Study of Law Firms in Norway and Australia." Information Resources Management Journal 16 (1): 14–23. Kim, Junghwan, Jaeki Song, and Donald R. Jones. 2011. "The Cognitive Selection Framework for Knowledge Acquisition Strategies in Virtual Communities." International Journal of Information Management 31 (2): 111-120. Ko, Dong-Gil. 2010. "Consultant Competence Trust Does Not Pay Off, But Benevolent Trust Does! Managing Knowledge with Care." Journal of Knowledge Management 14 (2): 202-213. Koman, Gabriel, and Jana Kundrikova. 2016. "Application of Big Data Technology in Knowledge Transfer Process Between Business and Academia." Procedia Economics and Finance. 605-611. Kowalik, Grzegorz, and Radoslaw Nielek. 2016. "Senior Programmers: Characteristics of Elderly Users from Stack Overflow." International Conference on Social Informatics. Cham: Springer. 87-96. Laudon, Kenneth C., and Jane P. Laudon. 2006. "Managing Knowledge in the Digital Firm." In Management Information Systems: Managing the Digital Firm, 563-618. Pearson. Masudur-Rahman, Mohammad, and Chanchal K. Roy. 2018. "An Insight into the Unresolved Questions at Stack Overflow." arXiv. 96 Matei, Sorin Adam, Amani Abu Jabal, and Elisa Bertino. 2018. "Social-Collaborative Determinants of Content Quality in Online Knowledge Production Systems: Comparing Wikipedia and Stack Overflow." Social Network Analysis and Mining 8 (1): 36. Murase, Kimiya, and Kunihito Kato. 2018. "Near-IR Material Discrimination Method by Using Multidimensional Response Variables PLS Regression Analysis." In 2018 International Workshop on Advanced Image Technology (IWAIT). IEEE. 1-4. National Science Foundation. 2017. Chapter 8. Invention, Knowledge Transfer, and Innovation. Accessed November 22, 2018. https://www.Nsf.gov/statistics/2018/nsb20181/report/sections/invention-knowledge- transfer-and-innovation/knowledge-transfer. Nowlan, Michael F., and M. Brian Blake. 2007. "Agent-Mediated Knowledge Sharing for Intelligent Services Management." Information Systems Frontiers 9 (4): 411-421. doi:10.1007/s10796-007-9043-6. Oliveira, Nigini, Michael Muller, Nazareno Andrade, and Katharina Reinecke. 2018. "The Exchange in Stack Exchange: Divergences Between Stack Overflow and its Culturally Diverse Participants." Proceedings of the ACM on Human-Computer Interaction. 1-22. https://Doi.org/10.1145/3274399. Quick MBA. 2010. Open Innovation - Porter's Generic Strategies. Accessed December 10, 2018. http://www.QuickMba.com/entre/open-innovation/. Ragkhitwetsagul, Chaiyong, Jens Krinke, and Rocco Oliveto. 2017. Awareness and Experience of Developers to Outdated and License-Violating Code on Stack Overflow: An Online Survey. arXiv preprint arXiv:1806.08149vl [cs.SE]. UCL Computer Science Research Notes. 97 Raytheon Professional Services LLC. 2012. Onboarding and Knowledge Transfer. Training Industry, Incorporated. https://TrainingIndustry.com/content/uploads/2017/07/onboarding-and-knowledge- transfer-report.pdf. Roberts, Joanne. 2000. "From Know-How to Show-How? Questioning the Role of Information Communication Technologies in Knowledge Transfer." Technology Analysis and Strategic Management 12 (4): 429-433. doi:10.1080/713698499. Rodríguez-Pérez, Raquel, Luis Fernández, and Santiago Marco. 2018. "Overoptimism in Cross- Validation When Using Partial Least Squares-Discriminant Analysis for Omics Data: A Systematic Study." Analytical and Bioanalytical Chemistry 410 (23): 5981-5992. Rohrbach, Marcus, Michael Stark, and Bernt Schiele. 2011. "Evaluating Knowledge Transfer and Zero-Shot Learning in a Large-Scale Setting." In Computer Vision and Pattern Recognition (CVPR) - IEEE Conference 1641-1648. Roy, Kunal, Supratik Kar, and Rudra Narayan Das. 2015. "Selected Statistical Methods in QSAR." Chap. 6 in Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment. Academic Press. Santhanam, Radhika, Larry Seligman, and David Kang. 2007. "Post-Implementation Knowledge Transfers to Users and Information Technology Professionals." Edited by Gatton College of Business and Economics, University of Kentucky School of Management. Journal of Management Information Systems 24 (1): 171-199. https://www.Jstor.org/stable/40398886. Sarker, Saonee, Suprateek Sarker, Darren B. Nicholson, and Kshiti D. Joshi. 2005. "Knowledge Transfer in Virtual Systems Development Teams: An Exploratory Study of Four Key 98 Enablers." IEEE Transactions on Professional Communication 48 (2): 201-218. doi:10.1109/TPC.2005.849650. Scandura, Terri A. 1992. "Mentoring and Career Mobility: An Empirical Investigation." Journal of Organizational Behavior 13 (2): 169-174. Schaufeld, Jerry. 2015. In Commercializing Innovation: Turning Technology Breakthroughs Into Products, 166. New York: Apress. Shahbandi, Saeed Gholami, Martin Magnusson, and Karl Iagnemma. 2018. "Nonlinear Optimization of Multimodal Two-Dimensional Map Alignment with Application to Prior Knowledge Transfer." IEEE Robotics and Automation Letters 3 (3): 2040-2047. Simon, Herbert Alexander. 1971. Designing Organizations for an Information-Rich World. Vols. Martin Greenberger: Computers, Communication, and the Public Interest. Baltimore: The Johns Hopkins Press. Simon, Steven J., Varun Grover, James T. C. Teng, and Kathleen Whitcomb. 1996. "The Relationship of Information System Training Methods and Cognitive Ability to End-User Satisfaction, Comprehension, and Skill Transfer: A Longitudinal Field Study." Information Systems Research 7 (4): 466–490. Slaughter, Sandra A., and Laurie J. Kirsch. 2006. "The Effectiveness of Knowledge Transfer Portfolios in Software Process Improvement: A Field Study." Information Systems Research 17 (3): 301-320. https://Doi.org/10.1287/isre.1060.0098. Starcevich, Matt, and Friend, Fred. 1999. "Effective Mentoring Relationships from the Mentee's Perspective." Workforce 2-3. 99 Sun, Hao, Zhe Zhang, Babatunde Shittu Olasege, Zhong Xu, Qingbo Zhao, Peipei Ma, Qishan Wang, and Yuchun Pan. 2019. "Application of Partial Least Squares in Exploring the Genome Selection Signatures Between Populations." Heredity 122 (3): 288-293. Swap, Walter, Dorothy Leonard, Mimi Shields, and Lisa Abrams. 2001. "Using Mentoring and Storytelling to Transfer Knowledge in the Workplace." Journal of Management Information Systems 18 (1): 95-114. doi:10.1080/07421222.2001.11045668. Szulanski, Gabriel, Dimo Ringov, and Robert J. Jensen. 2016. "Overcoming Stickiness: How the Timing of Knowledge Transfer Methods Affects Transfer Difficulty." Organization Science 27 (2): 304-322. https://Doi.org/10.1287/orsc.2016.1049. Vasilescu, Bogdan, Vladimir Filkov, and Alexander Serebrenik. 2013. "Stack Overflow and GitHub: Associations Between Software Development and Crowdsourced Knowledge." Social Computing (SocialCom) - IEEE 188-195. Wang, Bo, Ning Zhang, Quan Lin, Songcan Chen, and Yuhua Li. 2010. "Semantic-Oriented Knowledge Transfer for Review Rating." Tsinghua Science and Technology 15 (6): 633- 641. Yang, Di, Aftab Hussain, and Cristina Videira Lopes. 2016. "From Query to Usable Code: An Analysis of Stack Overflow Code Snippets." ACM 391-402. doi:10.1145/2901739.2901767. Yin, Cao, Ye Yuntao, and Zhao Hongli. 2018. "Satellite Hyperspectral Retrieval of Turbidity for Water Source Based on Discrete Particle Swarm and Partial Least Squares." Transactions of the Chinese Society for Agricultural Machinery 49 (1): 173-182. Zhang, Jingxuan, He Jiang, Zhilei Ren, and Xin Chen. 2018. "Recommending APIs for API Related Questions in Stack Overflow." IEEE Access 6: 6205-6219. 100 Zhang, Michael J. 2007. "An Empirical Assessment of the Performance Impacts of Information Systems Support for Knowledge Transfer." Edited by Murray E. Jennex. International Journal of Knowledge Management (Idea Group Publishing) 3 (1): 66-85. Zhang, Xinmin, Manabu Kano, Masahiro Tani, Junichi Mori, Junji Ise, and Kohhei Harada. 2018. "Hurdle Modeling for Defect Data with Excess Zeros in Steel Manufacturing Process." IFAC - Papers On Line 51 (18): 375-380. Zhang, Yun, David Lo, Xin Xia, and Jian-Ling Sun. 2015. "Multi-Factor Duplicate Question Detection in Stack Overflow." Journal of Computer Science and Technology 30 (5): 981- 997. doi:10.1007/s11390-015-1576-4. 101 Appendix A Table A-1. SEDE’s sample data for the First Hypothesis Variables DV: IV: IV: IV: IV: IV: IV: Answer_ Question_ Question_ Answer_ Author_ Comment_ Suggested_ Score Score Number_ Text_ Views Sum Edits_ Of_Views Length Count 1 3,264 2,421 360,255 1,593 21,365 870 114 2 1,127 792 218,788 11,362 3,645 435 115 3 1,123 651 184,312 783 132 360 75 4 863 479 229,164 531 406 360 116 5 844 241 65,835 446 30 75 18 … … … … … … … … 1,481 -4 17 16,319 429 256 14 4 1,482 -5 24 2,089 797 132 3 3 1,483 -7 11 16,199 159 32 24 12 1,484 -7 2 600 2,172 71 4 11 1,485 -8 -3 127 519 83 21 15 Note. DV means dependent variable (Y-axis). IV means independent variable (X axis containing confirmed predictors). The table displays the first five and the last five records collected. All data is shown before transformation. 102 Table A-2. SEDE’s sample data for the Second Hypothesis Variables DV: IV: IV: IV: IV: IV: Answer_ Answer_ Bags_Of_ Tag_Sum Comment_ Suggested_Edits_ Accepted Score Words_Score Sum Count 1 1 3,264 24 107,427,957 870 114 2 1 1,127 17 38,454,827 435 115 3 1 1,123 43 5,668,860 360 75 4 1 863 22 93,960 360 116 5 1 844 7 7,718,232 75 18 … … … … … … … 1,481 1 -4 63 4,038,158 14 4 1,482 1 -5 32 1,275,204 3 3 1,483 0 -7 18 5,841,447 24 12 1,484 1 -7 106 3,401,541 4 11 1,485 1 -8 22 9,834,340 21 15 Note. DV means dependent variable (Y-axis). IV means independent variable (X axis containing confirmed predictors). The table displays the first five and the last five records collected. All data is shown before transformation. 103 Table A-3. List of SEDE terms Table Name Column Data Type (Precision) Description Badges116 Id INT (10) PK. Badges UserId INT (10) FK to Users. Badges Name NVARCHAR (50) Name of the badge. Badges Date DATETIME yyyy-MM-dd hh:mm:ss.fff.117 Badges Class TINYINT (3) Gold (1), silver (2), or bronze (3). Badges TagBased BIT True for tagged badges; false for named badges. CloseAsOffTopic Id SMALLINT (5) PK. ReasonTypes CloseAsOffTopic IsUniversal BIT - ReasonTypes 116 See Appendix B, Table B-1, Queries #2 and #3. 117 All timestamps use UTC (Coordinated Universal Time or Greenwich Mean Time). To make a conversion to EST (Eastern Standard Time), use Query #4. To list all time zones, use Query #5. 104 CloseAsOffTopic MarkdownMini NVARCHAR (500) Close reason’s markdown. ReasonTypes CloseAsOffTopic CreationDate DATETIME (3) yyyy-MM-dd hh:mm:ss.fff. ReasonTypes CloseAsOffTopic CreationModeratorId INT (10) FK to Users. ReasonTypes (optional) CloseAsOffTopic ApprovalDate (optional) DATETIME (3) yyyy-MM-dd hh:mm:ss.fff. ReasonTypes CloseAsOffTopic ApprovalModeratorId INT (10) FK to Users. ReasonTypes (optional) CloseAsOffTopic DeactivationDate DATETIME yyyy-MM-dd hh:mm:ss.fff. ReasonTypes (optional) CloseAsOffTopic DeactivationModeratorId INT (10) FK to Users. ReasonTypes (optional) CloseReasonTypes Id TINYINT (3) PK. CloseReasonTypes Name NVARCHAR (200) Current: 101 = Duplicate 102 = Off-topic 103 = Unclear what you are asking 105 104 = Too broad 105 = Primarily opinion- based Deprecated: 1 = Exact duplicate 2 = Off-topic 3 = Not constructive, subjective, or argumentative 4 = Not a real question 7 = Too localized 10 = General reference 20 = Noise or pointless CloseReasonTypes Description (optional) NVARCHAR (500) Detailed description. Comments Id INT (10) PK. Comments PostId INT (10) FK to Posts. Comments Score INT (10) Relevance index. There is no voting feature for comments; hence, they have no impact on a user’s reputation. 106 Comments Text NVARCHAR (600) Comment body. Only users having a reputational score of 50 or higher can enter comments. Comments CreationDate DATETIME (3) When the comment was created. Comments UserDisplayName NVARCHAR (30) Author’s display name or (optional) anonymous. Comments UserId (optional) INT (10) FK to Users. The user account might not exist. FlagTypes Id TINYINT (3) PK. FlagTypes Name NVARCHAR (50) Question recommended close (13), question close (14), and question reopen (15). FlagTypes Description NVARCHAR (500) A user without close privileges suggests a question should be closed. User with close privileges is voting to reopen a question. User with close privileges is voting to close a question. 107 PendingFlags Id INT (10) PK. PendingFlags FlagTypeId TINYINT (3) FK to FlagTypes. PendingFlags PostId INT (10) FK to Posts. PendingFlags CreationDate (optional) DATE Record creation date. PendingFlags CloseReasonTypeId TINYINT (3) FK to CloseReasonTypes. (optional) PendingFlags CloseAsOffTopicReason SMALLINT (5) FK to CloseAsOffTopic TypeId (optional) ReasonTypes, but only when the close reason is off-topic. PendingFlags DuplicateOfQuestionId INT (10) FK to Posts for old or current (optional) duplicates. PendingFlags BelongsOnBaseHost NVARCHAR (100) Votes to close and migrate. Address (optional) PostFeedback Id INT (10) PK. This table stores positive and negative votes from non- registered users. PostFeedback PostId INT (10) FK to Posts. 108 PostFeedback IsAnonymous (optional) BIT Anonymous or unregistered users with no reputation. PostFeedback VoteTypeId TINYINT (3) FK to VoteTypes. PostFeedback CreationDate DATETIME (3) Record creation date. PostHistory Id INT (10) PK. PostHistory PostHistoryTypeId TINYINT (3) FK to PostHistoryTypes. PostHistory PostId INT (10) FK to Posts. PostHistory RevisionGUID UNIQUEIDENTIFIER Group multiple historical records that occurred in a single action. PostHistory CreationDate DATETIME (3) Record creation date. PostHistory UserId (optional) INT (10) FK to Users. PostHistory UserDisplayName NVARCHAR (40) Only visible when the User (optional) ID is not available. PostHistory Comment (optional) NVARCHAR (400) Comments entered by users editing the post. When the value equals 10, the close reason will be visible. When 109 the value equals 33 or 34, the Post Notice ID will be visible. PostHistory Text (optional) NVARCHAR (-1) Revision’s content: JSON string with all users who voted when Post History Type ID equals 10, 11, 12, 13, 14, 15, 19, 20, or 35; JSON string with all Original Question IDs if it is a duplicate close vote; and, if the ID equals 17, it will contain migration metadata. PostHistoryTypes Id TINYINT (3) PK: 1 = Initial title for questions only 2 = Initial body 3 = Initial tags for questions only 4 = Edit title for questions only 5 = Edit body for raw markdown 6 = Edit tags for questions only 110 7 = Rollback title for questions only 8 = Rollback body for raw markdown 9 = Rollback tags for questions only 10 = Post closed 11 = Post reopened 12 = Post deleted 13 = Post undeleted or restored 14 = Post locked by moderator 15 = Post unlocked by moderator 16 = Community-owned 17 = Post migrated (replaced by codes 35 and 36, which stand for away and here) 18 = Question merged with deleted question 19 = Question protected by moderator 20 = Question unprotected by moderator 21 = Post disassociated by administrator 111 22 = Question unmerged (metadata was restored to a previously-merged question) 24 = Suggested edit applied 25 = Post tweeted 31 = Comment discussion moved to chat 33 = Post notice added (FK to PostNotices) 34 = Post notice removed 35 = Post migrated away (refer to code 17) 36 = Post migrated here (refer to code 17) 37 = Post merge source 38 = Post merge destination 50 = Community bump Deprecated: 23 = Unknown development- related event 26 = Vote nullification by developer 27 = Post unmigrated or hidden by a moderator 28 = Unknown suggestion event 112 29 = Unknown moderator event (possibly due to dewikification) 30 = Unknown event PostHistoryTypes Name NVARCHAR (50) Name associated with the ID, as outlined above. PostLinks Id INT (10) PK. PostLinks CreationDate DATETIME (3) Record creation date. PostLinks PostId INT (10) FK to Posts (original or source). PostLinks RelatedPostId INT (10) FK to Posts (related or target). PostLinks LinkTypeId TINYINT (3) Type of link (in reference to Related Post ID): 1 = Linked 3 = Duplicate PostNotices Id INT (10) PK. PostNotices PostId INT (10) FK to Posts. 113 PostNotices PostNoticeTypeId INT (10) FK to PostNoticeTypes. (optional) PostNotices CreationDate DATETIME (3) Record creation date. PostNotices DeletionDate (optional) DATETIME (3) yyyy-MM-dd hh:mm:ss.fff. PostNotices ExpiryDate (optional) DATETIME (3) yyyy-MM-dd hh:mm:ss.fff. PostNotices Body (optional) NVARCHAR (-1) Custom text that accompanies the notice. PostNotices OwnerUserId (optional) INT (10) FK to Users. PostNotices DeletionUserId (optional) INT (10) FK to Users. PostNoticeTypes Id INT (10) PK. PostNoticeTypes ClassId TINYINT (3) 1 = Historical lock 2 = Bounty 4 = Moderator notice PostNoticeTypes Name (optional) NVARCHAR (80) Name of the notice type. PostNoticeTypes Body (optional) NVARCHAR (-1) Default notice content. PostNoticeTypes IsHidden BIT Whether the notice type is visible or not. 114 PostNoticeTypes Predefined BIT Whether the notice type is a constant or not. PostNoticeTypes PostNoticeDurationId INT (10) -1 = No duration specified 1 = Seven days for bounties Posts Id INT (10) PK. This table stores all active, non-deleted posts. Posts PostTypeId TINYINT (3) FK to PostTypes. Posts AcceptedAnswerId INT (10) Only visible when the Post (optional) Type ID equals one (1). Posts ParentId (optional) INT (10) Only visible when the Post Type ID equals two (2). Posts CreationDate DATETIME (3) Record creation date. Posts DeletionDate (optional) DATETIME (3) Available only when there is a PostsWithDeleted record. Posts Score INT (10) Relevance score. Posts ViewCount (optional) INT (10) The number of times the post has been viewed. 115 Posts Body (optional) NVARCHAR (-1) Displayed as rendered (not markdown) HTML. Posts OwnerUserId (optional) INT (10) FK to Users. Only visible when the user is active. Wiki entries are owned by the community and have a value of -1. Posts OwnerDisplayName NVARCHAR (40) User’s friendly name. (optional) Posts LastEditorUserId INT (10) Last user who edited the (optional) post. Posts LastEditorDisplayName NVARCHAR (40) User’s friendly name. (optional) Posts LastEditDate (optional) DATETIME (3) Most recent edit date/time. Posts LastActivityDate DATETIME (3) Most recent activity (optional) date/time. Posts Title (optional) NVARCHAR (250) Post’s title. Posts Tags (optional) NVARCHAR (250) Tags or keywords used. Each post can have a maximum of five tags, considering it 116 might be related to several subjects. Posts AnswerCount (optional) INT (10) Number of answers entered. Posts CommentCount (optional) INT (10) Number of comments entered. Posts FavoriteCount (optional) INT (10) The number of times the post has been favorited. Posts ClosedDate (optional) DATETIME (3) Visible when the post is closed. Posts CommunityOwnedDate DATETIME (3) Visible when the post (optional) belongs to a community wiki. PostsWithDeleted Id INT (10) PK. This table’s schema duplicates Posts and stores all deleted posts. PostsWithDeleted PostTypeId TINYINT (3) FK to PostTypes. PostsWithDeleted AcceptedAnswerId INT (10) Not populated. (optional) 117 PostsWithDeleted ParentId (optional) INT (10) Only visible when the Post Type ID equals 2. PostsWithDeleted CreationDate DATETIME (3) Record creation date. PostsWithDeleted DeletionDate (optional) DATETIME (3) Available only when there is a PostsWithDeleted record. PostsWithDeleted Score INT (10) Relevance score. PostsWithDeleted ViewCount (optional) INT (10) Not populated. PostsWithDeleted Body (optional) NVARCHAR (-1) Not populated. PostsWithDeleted OwnerUserId (optional) INT (10) Not populated. PostsWithDeleted OwnerDisplayName NVARCHAR (40) Not populated. (optional) PostsWithDeleted LastEditorUserId INT (10) Not populated. (optional) PostsWithDeleted LastEditorDisplayName NVARCHAR (40) Not populated. (optional) PostsWithDeleted LastEditDate (optional) DATETIME (3) Not populated. 118 PostsWithDeleted LastActivityDate DATETIME (3) Not populated. (optional) PostsWithDeleted Title (optional) NVARCHAR (250) Not populated. PostsWithDeleted Tags (optional) NVARCHAR (250) Tags or keywords used. PostsWithDeleted AnswerCount (optional) INT (10) Not populated. PostsWithDeleted CommentCount (optional) INT (10) Not populated. PostsWithDeleted FavoriteCount (optional) INT (10) Not populated. PostsWithDeleted ClosedDate (optional) DATETIME (3) Visible when the post is closed. PostsWithDeleted CommunityOwnedDate DATETIME (3) Not populated. (optional) PostTags PostId INT (10) FK to Posts. PostTags TagId INT (10) FK to Tags. PostTypes Id TINYINT (3) PK: 1 = Question 2 = Answer 3 = Orphaned tag wiki 4 = Tag wiki excerpt 119 5 = Tag wiki 6 = Moderator nomination 7 = Wiki placeholder (or election description) 8 = Privileged wiki PostTypes Name NVARCHAR (50) Name of the post type. ReviewRejection Id TINYINT (3) PK: Reasons 1 = Radical change 2 = Vandalism 3 = Too minor 4 = Invalid edit 5 = Copied content 6 = Wiki not helpful 7 = Excerpt not helpful 8 = Suggested edit conflict 9 = Critical issues 101 = Spam or vandalism 102 = No improvement whatsoever 103 = Irrelevant tag 104 = Clearly conflicts with author’s intent 105 = Attempt to reply 106 = Copied content 107 = Lacks usage guidance 108 = Suggested edit conflict 120 109 = Critical issues 110 = Circular tag definition ReviewRejection Name NVARCHAR (100) Review rejection reason’s Reasons name. ReviewRejection Description NVARCHAR (300) A detailed description of the Reasons review rejection reason. ReviewRejection PostTypeId (optional) TINYINT (3) FK to PostTypes, only when Reasons Post Types ID equals Wiki not helpful (6) or Excerpt not helpful (7). ReviewTask Id INT (10) PK. Results ReviewTask ReviewTaskId INT (10) FK to ReviewTasks. Results ReviewTask ReviewTaskResultTypeId TINYINT (3) FK to Results ReviewTaskResultTypes. ReviewTask CreationDate (optional) DATE (0) Record creation date. Results 121 ReviewTask RejectionReasonId TINYINT (3) FK to Results (optional) ReviewRejectionReasons for suggested edits. ReviewTask Comment (optional) NVARCHAR (150) Any clarifying comments. Results ReviewTask Id TINYINT (3) PK: ResultTypes 1 = Not sure 2 = Approve for suggested edits 3 = Reject for suggested edits 4 = Delete for low-quality posts 5 = Edit for first-time (low- quality posts and late answers) 6 = Close for low-quality posts 7 = Looks good for high- quality posts 8 = Do not close 9 = Recommend deletion for low-quality answers 10 = Recommend close for low-quality questions 11 = I am done (related to first posts) 122 12 = Reopen 13 = Leave closed 14 = Edit and reopen 15 = Excellent by community evaluation 16 = Satisfactory by community evaluation 17 = Needs improvement by community evaluation 18 = No action needed (related to first posts and late answers) 19 = Reject and edit 20 = Should be improved 21 = Unsalvageable ReviewTask Name NVARCHAR (100) Name of the review task ResultTypes result type. ReviewTask Description NVARCHAR (300) A detailed description of the ResultTypes review task result type. ReviewTasks Id INT (10) PK. ReviewTasks ReviewTaskTypeId TINYINT (3) FK to ReviewTaskTypes. ReviewTasks CreationDate (optional) DATE (0) Record creation date. 123 ReviewTasks DeletionDate (optional) DATE (0) Record deletion date. ReviewTasks ReviewTaskStateId TINYINT (3) FK to ReviewTaskStates. ReviewTasks PostId INT (10) FK to Posts. ReviewTasks SuggestedEditId INT (10) FK to SuggestedEdits. It (optional) contains internal numbering for historicity. ReviewTasks CompletedByReview INT (10) FK to ReviewTaskResults TaskId (optional) for the outcome of a completed review. ReviewTaskStates Id TINYINT (3) PK: 1 = Active or when the review task is still in the queue. 2 = Completed or the review task was completed so it is no longer in the queue. 3 = Invalidated when the task was dequeued naturally. A post might be edited to achieve higher quality, or close votes might expire, all of which are completely 124 separated from the review dashboard. ReviewTaskStates Name NVARCHAR (50) Name of the review task state. ReviewTaskStates Description NVARCHAR (300) A detailed description of the review task state. ReviewTaskTypes Id TINYINT (3) PK: 1 = Suggested edit 2 = Close votes 3 = Low-quality posts 4 = First post 5 = Late answer 6 = Reopen vote 7 = Community evaluation 10 = Triage 11 = Helper Deprecated: 8 = Link validation 9 = Flagged posts ReviewTaskTypes Name NVARCHAR (50) Name of the review task type. 125 ReviewTaskTypes Description NVARCHAR (300) A detailed description of the review task type. SuggestedEdits Id INT (10) PK. This record is visible (under review) when both approval and rejection dates are blank, and the ReviewTasks record holds an active state. SuggestedEdits PostId INT (10) FK to Posts. SuggestedEdits CreationDate (optional) DATETIME (3) Record creation date. SuggestedEdits ApprovalDate (optional) DATETIME (3) Record approval date is only visible for approved edits. SuggestedEdits RejectionDate (optional) DATETIME (3) Record rejection date is only visible for rejected edits. SuggestedEdits OwnerUserId (optional) INT (10) FK to Users. SuggestedEdits Comment (optional) NVARCHAR (800) Additional user comment. SuggestedEdits Text (optional) NVARCHAR (-1) Edit content. SuggestedEdits Title (optional) NVARCHAR (250) Edit title. 126 SuggestedEdits Tags (optional) NVARCHAR (250) Edit tags. SuggestedEdits RevisionGUID (optional) UNIQUEIDENTIFIER Group multiple edits. SuggestedEdit Id INT (10) PK. Votes SuggestedEdit SuggestedEditId INT (10) FK to SuggestedEdits. Votes SuggestedEdit UserId INT (10) FK to Users (the person who Votes is making the suggestion). SuggestedEdit VoteTypeId TINYINT (3) FK to VoteTypes. Votes SuggestedEdit CreationDate DATETIME (3) Record creation date. Votes SuggestedEdit TargetUserId (optional) INT (10) FK to Users (the person who Votes is the target of the suggestion). SuggestedEdit TargetRepChange INT (10) Targeted area to be edited. Votes Tags Id INT (10) PK. 127 Tags TagName (optional) NVARCHAR (35) Name of the tag. Tags Count INT (10) The number of times it has been listed. Tags ExcerptPostId (optional) INT (10) FK to Posts. Tags WikiPostId (optional) INT (10) FK to Posts. TagSynonyms Id INT (10) PK. TagSynonyms SourceTagName NVARCHAR (35) FK to Tags (origin). (optional) TagSynonyms TargetTagName NVARCHAR (35) FK to Tags (destination). (optional) TagSynonyms CreationDate DATETIME (3) Record creation date. TagSynonyms OwnerUserId INT (10) FK to Users. TagSynonyms AutoRenameCount INT (10) The number of times the tag has been autocorrected. TagSynonyms LastAutoRename DATETIME (3) Last time the tag was (optional) autocorrected. TagSynonyms Score INT (10) Relevance score. 128 TagSynonyms ApprovedByUserId INT (10) FK to Users. (optional) TagSynonyms ApprovalDate (optional) DATETIME (3) Record approval date. Users Id INT (10) PK. Users Reputation INT (10) User reliability or estimated measure of community trust toward a user. A maximum of 200 points can be gained per diem: +5: Question upvote +10: Answer upvote +15: Accepted/chosen answer Higher reputational scores will unlock privileges and enable access to additional features. Stack Overflow values the community’s feedback over asker’s feedback. Users CreationDate DATETIME (3) Record creation date. 129 Users DisplayName (optional) NVARCHAR (40) User friendly name. Users LastAccessDate DATETIME (3) Last date/time when a user loaded or viewed a page. At best, this is updated every thirty minutes. Users WebsiteUrl (optional) NVARCHAR (200) User’s personal website. Users Location (optional) NVARCHAR (100) City or region where the user claims to reside. Users AboutMe (optional) NVARCHAR (-1) Biographical information or role description. Users Views INT (10) The number of times the profile has been viewed (popularity). Users UpVotes INT (10) The number of positive votes received (favorable popularity). Users DownVotes INT (10) The number of negative votes received (detrimental popularity). 130 Users ProfileImageUrl NVARCHAR (200) Link to the user’s picture, (optional) avatar, or logo. Users EmailHash (optional) VARCHAR (32) Deprecated field. Users AccountId (optional) INT (10) Stack Exchange Network’s Profile ID. Votes Id INT (10) PK. Votes PostId INT (10) FK to Posts. Votes VoteTypeId TINYINT (3) FK to VoteTypes. Votes UserId (optional) INT (10) FK to Users. Only visible when the Vote Type ID equals 5 (favorited) or 8 (bountied). The Vote Type ID will equal -1 for deleted users. Votes CreationDate (optional) DATETIME (3) The record creation date is hidden to protect the user’s privacy. Votes BountyAmount (optional) INT (10) Only visible when the Vote Type ID equals 8 or 9; i.e., 131 the bounty was started or closed. VoteTypes Id TINYINT (3) PK: 1 = Accepted by originator 2 = UpMod or approve/positive/upvote 3 = DownMod or reject/negative/downvote 4 = Offensive 5 = Favorite (includes the User ID) 6 = Close (creates a PostHistory record as of June 2013) 7 = Reopen 8 = Bounty start (populates User ID and bounty amount) 9 = Bounty close (populates a bounty amount) 10 = Deletion 11 = Undeletion 12 = Spam 15 = Moderator review (the moderator has viewed the post after the post was flagged for moderator’s attention) 132 16 = Approve edit suggestion (a user voted to approve a suggested edit). For getting approved, most posts require several user votes or a single, binding moderator vote VoteTypes Name NVARCHAR (50) Name of the vote type. 133 Appendix B Table B-1. Query list118 Query #1. SELECT CAST(AVG([Score]) AS DECIMAL(10, 2)) FROM [Posts]; Query #2. SELECT c.[Table_Name], CASE WHEN ##PK:STRING?id## = c.[Column_Name] THEN CONCAT(c.[Column_Name], ' (PK)') ELSE c.[Column_Name] END AS Column_Name, [Data_Type], [Is_Nullable], COALESCE(CHARACTER_MAXIMUM_LENGTH, Numeric_Precision, DateTime_Precision) AS [Length / Precision] FROM [Information_Schema].[Columns] c WHERE c.[Table_Name] = ##TABLE:STRING?Posts## ORDER BY [Ordinal_Position] ASC; 118 All queries listed here were created by the Stack Exchange community and are publicly available. Source: https://Data.StackExchange.com/stackoverflow/queries. 134 Query #3. SELECT 'query://874190/?table=' + [Table_Name] + '|' + [Table_Name] AS [Table(Links to Sample Data)], ‐‐ Sample data subquery. [Ordinal_Position] AS [Field Number], [Column_Name] AS [Column], CASE WHEN [Data_Type] LIKE '%VARCHAR%' THEN [Data_Type] + '(' + CAST(Character_Maximum_Length AS VARCHAR) + ')' WHEN [Data_Type] LIKE '%INT%' THEN [Data_Type] + '(' + CAST(Numeric_Precision AS VARCHAR) + ')' ELSE [Data_Type] END AS [Type], [Table_Schema], ‐‐ dbo. [Column_Default], ‐‐ Nullable. [DateTime_Precision], ‐‐ It has a value of three for DateTime data type. [Numeric_Precision_Radix], ‐‐ Only visible for TinyInt, SmallInt, and Int data types. [Numeric_Scale] ‐‐ Nullable. FROM [Information_Schema].[Columns] ‐‐WHERE ‐‐ [Table_Name] LIKE '%Types%' ‐‐ Only applicable to static referential tables. ORDER BY [Table_Name] ASC, [Ordinal_Position] ASC; Query #4. SELECT TOP (1) [CreationDate] AS [CreationDate] , DATEADD(HOUR, ‐4, [CreationDate]) AS [CreationDateEST] , GETDATE() AS [CurrentDate] , DATEADD(HOUR, ‐4, GETDATE()) AS [CurrentDateEST] FROM [Posts]; Query #5. SELECT * FROM [Sys].[Time_Zone_Info]; 135 ")) { count = count + 10; } File.AppendAllText(_textScoresFile, count.ToString() + "\n"); }