Predictive Model: Using Text Mining for Determining Factors Leading to High- Scoring Answers in

by Raul Quintana Selleras

B.S. in Information Technology, August 2012, Florida International University B.A. in Religious Studies, December 2012, Florida International University M.S. in Information Systems, May 2015, The University of Texas at Arlington

A Praxis submitted to

The Faculty of The School of Engineering and Applied Science of The George Washington University in partial fulfillment of the requirements for the degree of Doctor of Engineering

January 10, 2020

Praxis directed by

Timothy Blackburn Professorial Lecturer of Engineering Management and Systems Engineering

Amir Etemadi Associate Professor of Engineering and Applied Science The School of Engineering and Applied Science of The George Washington University

certifies that Raul Quintana Selleras has passed the Final Examination for the degree of

Doctor of Engineering as of October 15, 2019. This is the final and approved form of the

Praxis.

Predictive Model: Using Text Mining for Determining Factors Leading to High- Scoring Answers in Stack Overflow

Raul Quintana Selleras

Praxis Research Committee:

Timothy Blackburn, Professorial Lecturer of Engineering Management and Systems Engineering, Praxis Co-Director

Amir Etemadi, Associate Professor of Engineering and Applied Science, Praxis Co-Director

Ebrahim Malalla, Visiting Associate Professor of Engineering and Applied Science, Committee Member

ii

© Copyright 2019 by Raul Quintana Selleras All rights reserved

iii

Dedication

The author wishes to dedicate this dissertation to his daughter, Alexia Quintana and to his wife, Kristina Quintana for their unconditional support. Also, the author would like to thank his parents, Raul Quintana Sarduy and Gilda Selleras Rivas, whose encouragement was vital to his educational accomplishments.

iv

Acknowledgments

The author wishes to acknowledge his praxis director, Dr. Timothy Blackburn; his editor, Peter Rosenbaum, along with all faculty and staff from the Doctor of Engineering program and the students from the seventh cohort.

The author thanks Andrew Rothman and Lucas Longan for their insightful suggestions.

v

Abstract of Praxis

Predictive Model: Using Text Mining for Determining Factors Leading to High- Scoring Answers in Stack Overflow

With the advent of knowledge-based economies, knowledge transfer within online

forums has become increasingly important to the work of IT teams. Stack Overflow, for example, is an online community in which computer programmers can interact and consult with one another to achieve information flow efficiencies and bolster their

reputations, which are numerical representations of their standings within the platform.

The high volume of information available in Stack Overflow in the context of significant

variance in members’ expertise and, hence, the quality of their posts hinders knowledge

transfer and causes developers to waste valuable time locating good answers.

Additionally, invalid answers can introduce security vulnerabilities and/or legal risks.

By conducting text analytics and regression, this research presents a predictive model to optimize knowledge transfer among software developers. This model incorporates the identification of factors (e.g., good tagging, answer character count, tag frequency) that reliably lead to high-scoring answers in Stack Overflow. Upon applying natural language processing, the following variables were found to be significant: (a) the number of answers per question, (b) the cumulative tag score, (c) the cumulative

comment score, and (d) the bags of words’ frequency. Additional methods were used to

identify the factors that contribute to an answer being selected by the user who posted the

question, the community at large, or both.

vi

Predicting what constitutes a good, accurate answer helps not only developers but also Stack Overflow, as the site can redesign its user interface to make better use of its knowledge repository to transfer knowledge more effectively. Likewise, companies who use the platform can decrease the amount of time and resources invested in training, fix software bugs faster, and complete challenging projects in a timely fashion.

vii

Table of Contents

Dedication ...... iv

Acknowledgments ...... v

Abstract of Praxis ...... vi

List of Figures ...... x

List of Tables ...... xii

List of Symbols / Nomenclature ...... xiii

Glossary of Terms ...... xiv

Chapter 1: Introduction ...... 1

1.1 Background ...... 1

1.2 Research Motivation ...... 5

1.3 Problem Statement ...... 6

1.4 Thesis Statement ...... 8

1.5 Research Objectives ...... 10

1.6 Research Questions and Hypotheses ...... 12

1.7 Scope of Research ...... 14

1.8 Research Limitations ...... 14

1.9 Organization of Praxis ...... 15

Chapter 2: Literature Review ...... 17

2.1 Introduction ...... 17

2.2 Information, Knowledge, and Related Concepts ...... 18

2.3 Digging into Stack Overflow ...... 21

2.4 Knowledge Transfer ...... 24

viii

2.5 Online Forums ...... 31

2.6 Summary and Conclusions ...... 33

Chapter 3: Methodology ...... 39

3.1 Introduction ...... 39

3.2 Data Collection and Analysis ...... 43

3.3 Research Methods ...... 47

Chapter 4: Results ...... 57

4.1 Introduction ...... 57

4.2 Data Collection and Preprocessing ...... 59

4.3 Predictive Models ...... 67

4.4 Case Studies ...... 80

Chapter 5: Discussion and Conclusions ...... 86

5.1 Discussion ...... 86

5.2 Conclusions ...... 86

5.3 Contributions to Body of Knowledge ...... 88

5.4 Recommendations for Future Research ...... 88

References ...... 92

Appendix A ...... 102

Appendix B ...... 134

ix

List of Figures

Figure 1-1. Stack Overflow question...... 4

Figure 1-2. Stack Overflow answer...... 4

Figure 1-3. Stack Overflow and optimal answer region...... 11

Figure 2-1. Interest graph...... 20

Figure 3-1. Delen’s holistic framework for knowledge management discovery...... 40

Figure 3-2. Supportability analysis...... 43

Figure 3-3. Relationship between critical data fields collected...... 45

Figure 3-4. Research product...... 49

Figure 3-5. Data preprocessing...... 51

Figure 3-6. Text analytics...... 52

Figure 3-7. ProWritingAid toolbar in Microsoft Office 365 ProPlus...... 53

Figure 3-8. ProWritingAid summary...... 53

Figure 3-9. Post metadata...... 54

Figure 3-10. Author metadata...... 55

Figure 4-1. Stack Overflow users’ age, educational level, and wake-up time...... 58

Figure 4-2. Most popular technologies in Stack Overflow...... 59

Figure 4-3. Query used for data collection in SEDE...... 60

Figure 4-4. SEDE’s partial results...... 61

Figure 4-5. Query’s partial execution plan...... 61

Figure 4-6. C-sharp functionality for calculating BOW scores...... 66

Figure 4-7. Linking data source and quantitative methods...... 67

Figure 4-8. Post frequency distribution per week...... 69

x

Figure 4-9. PLS standard coefficient plot for answer score...... 73

Figure 4-10. PLS loading plot for answer score...... 73

Figure 4-11. PLS response plot for answer score...... 74

Figure 4-12. Relationship model between research hypotheses and case studies...... 80

xi

List of Tables

Table 3-1. Mapping Delen’s Holistic Framework into Stack Overflow ...... 41

Table 4-1. Sample data for a single post ...... 62

Table 4-2. Analysis of Variance ...... 74

Table 4-3. Model Selection and Validation ...... 75

Table 4-4. Coefficients ...... 75

Table 4-5. Deviance table ...... 78

Table 4-6. Model Summary ...... 78

Table 4-7. Coefficients ...... 78

Table 4-8. Goodness-of-fit Tests ...... 79

Table 4-9. Case studies in API and JavaScript ...... 84

Table A-1. SEDE’s sample data for the First Hypothesis ...... 102

Table A-2. SEDE’s sample data for the Second Hypothesis ...... 103

Table A-3. List of SEDE terms ...... 104

Table B-1. Query list ...... 134

xii

List of Symbols / Nomenclature

PK Primary Key

FK Foreign Key

xiii

Glossary of Terms

ANOVA Analysis of Variance

API Application Program Interface

BLR Binary Logistic Regression

BME Belief Measure of Expertise

BOW Bags of Words

BPSO Binary Particle Swarm Optimization

CICO Chief Intellectual Capital Officer

CINO Chief Innovation Officer

CKO Chief Knowledge Officer

CLO Chief Learning Officer

CPO Chief People Officer

CQA Community Question Answering

CSCW Computer-Supported Cooperative Work

CST Central Standard Time

CSV Comma-Separated-Values

DoS Denial of Service

EST Eastern Standard Time

xiv

FST Fixation Index

GWASs Genome-Wide Association Studies

HR Human Resources

ICTs Information and Communication Technologies

IS Information Systems

IT Information Technology

KB Knowledge Base

KM, VP of Vice President of Knowledge Management

LDA Latent Dirichlet Allocation

M&As Mergers and Acquisitions

MIMIC Multiple Indicators and Multiple Causes

NER Named Entity Recognition

NIR Near-Infrared Ray

NLP Natural Language Processing

NLTK Natural Language Toolkit

PLS Partial Least Squares

SEDE Stack Exchange Data Explorer

SEO Search Engine Optimization

xv

SO Stack Overflow

SOA Service-Oriented Architecture

SPI Software Process Improvement

SQL Structured Query Language

TSC Transfer Spectral Clustering

ULR Uniform Resource Locator

UTF Unicode Transformation Format

UX User Experience

VSD Value-Sensitive-Design

xvi

Chapter 1: Introduction

1.1 Background

Over the past two decades, economies have shifted from being production-based to being knowledge-based, and knowledge management software and services have become multi-billion-dollar industries (Kim, Song and Jones 2011; Roberts 2000; Sarker et al. 2005; Schaufeld 2015). A report from the National Science Board highlights the importance of knowledge transfer in the creation of successful patents and active licensing deals and the major influence knowledge-based transfer has had on invention and innovation (National Science Foundation 2017). Conversely, as Babcock (2004) observed, failed knowledge transfer1 can translate into loss of property and life.

A professional knowledge worker is an individual who researches, produces, analyzes, and interprets information and ideas. Knowledge transfer is the dynamic, unnatural2 practice of transferring knowledge between people and departments, usually trying to organize, create, capture, and distribute intellectual content within an organization (National Science Foundation 2017). Knowledge transfer differs from traditional learning or education in that it is not usually conducted in a hierarchical fashion.

Knowledge transfer is particularly inefficient for software development teams

(Raytheon Professional Services LLC 2012). Many developers rely on Internet resources

1 Both in terms of absorption and transmission.

2 Knowledge acquisition and sharing is time-consuming and usually discouraged by management. Besides, knowledge hoarding is correlated with job security (Davenport 1997).

1

and online forums rather than on other members of their teams to obtain answers to their

technical questions or concerns. Stack Overflow attempts to streamline knowledge

transfer between technical users by opening a collaboration platform to address programming problems.

Stack Overflow is the world’s largest and most popular Q&A community (online forum) for technical users. It services over 50 million computer programmers and software developers worldwide, 60% of whom work on back-end projects (Zhang 2018).

On average, each developer accesses the site a total of six times a month, spending at least fifteen seconds per visit. Additionally, Stack Overflow is the largest and most popular community within Stack Exchange, having the highest number of participants

and posts. As of June 2019, Stack Overflow had 11 million users and stored 18 million

questions and 27 million answers. Questions currently range over a wide variety of technical topics, such as programming languages, algorithms, APIs, disruptive

technologies, troubleshooting, configuration, and technical definitions. The success of the

Stack Overflow platform depends on users’ willingness to collaborate by asking and

answering each other’s questions (Calefato 2018). In terms of user participation, Stack

Overflow is an online knowledge production site controlled by small groups of

contributors3 (Matei 2018). Oliveira (2018) classifies users as experts4 or activists.5 Over

3 Also known as elite stickiness.

4 Having low participation with high-quality posts.

5 Having high participation with low-quality posts.

2

94% of participants infrequently contribute, and highly engaged contributors are extremely uncommon (Oliveira 2018).

Stack Overflow is an open community in which users have a say in how the site behaves, have access to the data from Stack Exchange, and experience a sense of ownership over the platform. This accomplishes both network and feedback effects across the entire community. Chua (2015) defines Stack Overflow as a knowledge- sharing, community-owned, online platform that fits a Community Question Answering

(CQA) site harnessing collective wisdom. Additionally, Stack Overflow fits under the

definition of Computer-Supported Cooperative Work (CSCW).

Stack Overflow adheres to a sequential process. First, registered users post a

question (see Figure 1-1). At this juncture, questioners can enter a title, a description, tags,6 and code snippets regarding the problem they are trying to solve. Other registered

users start offering solutions to the question. The community can upvote/praise,

downvote/criticize, or comment on each answer, deciding which answer gets the highest

score. Downvotes and upvotes directly establish the reputation of a given user; i.e., the

more upvotes a user gets, the better will be the user’s reputation within the community.

Furthermore, only the questioner can choose a given answer as the preferred one; the answer then becomes accepted and thereby resolved (see Figure 1-2).

6 Buzzwords identifying the subject matter of the inquiry.

3

Figure 1-1. Stack Overflow question.

Figure 1-2. Stack Overflow answer.

4

Successful questions (i.e., those that are answered) have certain attributes: They

are posed by reputable users, are short and clear, contain code snippets, do not abuse

uppercase characters, and adopt a neutral7 emotional style (Calefato 2018). To determine

the answerability of a question, researchers use a framework with the following features:

affect,8 presentation quality,9 post time, and reputation (Calefato 2018). Approximately

29% of questions on Stack Overflow do not receive answers that are accepted10 even

though many of them have one or more proposed solutions.

1.2 Research Motivation

Stack Overflow is one of the most popular among software developers

and programming enthusiasts. That fact, along with the naturally complex nature of

computer systems and the pervasiveness of programming bugs and hardware failures, is

the primary motivation for this praxis.

Finding an adequate solution to a programming problem is not a trivial issue.

Stack Overflow has limitations that can be overcome through better matching between

questioners and accurate answers. Determining what factors make a reliable, valid, and

high-scoring answer can have two major benefits. Firstly, Stack Overflow’s can

7 Stack Overflow is not a discussion forum.

8 Using a method known as sentiment analysis or opinion mining. SentiStrength and Voyant are two popular sentiment analysis tools.

9 Considering the presence of URLs and the number of uppercase characters.

10 Collected from https://StackExchange.com/sites, after clicking on the Stack Overflow panel.

5

be redesigned to encourage users to tailor their solutions and meet specific standards.

Secondly, askers would have their concerns addressed quickly and more efficiently.

Stack Overflow can be a driving force for transferring knowledge within

development teams while increasing job retention and bolstering code quality. This praxis

seeks to optimize answer searchability and improve knowledge transfer among software

developers in Stack Overflow.

1.3 Problem Statement

Twenty-nine percent of questions posted to Stack Overflow are never answered, and the accepted answers have a low average score,11 barely reaching a value of two.

Low scores tend to overshadow the quality of a solution and, therefore, prevent egalitarian (i.e., evenly distributed) information diffusion within the site and hinder knowledge transfer among software developers.

In general, the high volume of information in a knowledge base (Wang 2010) deters knowledge transfer, causing IT personnel to waste an average of 265 man-hours

(10-15%) annually (National Science Foundation 2017). Even if not all this time is spent browsing online forums or searching for answers in Stack Overflow, IT employees struggle to find reliable solutions to their pressing problems (Jacobs 2018).

11 See Appendix B, Table B-1, Query #1. Source: https://Data.StackExchange.com/stackoverflow/query/edit/988631. The average score is calculated as ∑ __ follows: _ . Even though there is considerable score __ variability from post to post, high scoring answers tend to have scores in the low hundreds.

6

Similarly, according to the Panopto Workplace Knowledge and Productivity

Report, (a) large companies can lose up to $47 million in productivity each year, (b)

knowledge professionals can waste 5.3 hours weekly, and (c) 81% of workers feel

frustrated when they cannot get the information needed to perform their jobs. Also, 85%

of employees claim that knowledge transfer positively impacts productivity12 (Jacobs

2018). Fortune 500 companies report losing over $31.5 billion a year, despite the huge investments they make on knowledge transfer programs (Babcock 2004). Indeed,

American companies invest over $100 billion in training, but less than 10% of this is invested in work-setting-oriented knowledge transfer support (Benabou and Benabo

1999). In general, some common barriers blocking knowledge transfer are the loss of job security, not wanting to be the bearer of bad news, distrusting coworkers and managers, fear of being ridiculed, and interference with one’s duties. The last two are the main inhibitors of online collaboration among developers.

IT departments rank third—behind customer service and sales—in terms of turnover attributable to limited knowledge transfer, which occurs predominantly during onboarding. Onboarding13 costs are extremely high. Raytheon has reported that the mean

cost of a single job turnover approaches $115,000 and it takes over 100 days to hire a

software developer on average.

Additionally, only 39% of executives leading training programs believe their

organizations are either effective or somewhat effective at transferring knowledge.

12 Productivity is defined as completing a project within a fixed time frame (Sarker et al. 2005).

13 In this context, onboarding refers to the training processes new employees have to go through.

7

Knowledge transfer programs often lack consistency, organizational alignment, time,

budget, tools, and executive support (Raytheon Professional Services LLC 2012).

Babcock (2004, 2) asserted, “Some of the consulting firms’ systems got too much

knowledge put in them, and people got overwhelmed by the amount of [information] they

had to deal with.” The massive amount of technical documentation hinders its usage,

causing IT teams to stop reading training documents altogether. A famous Latin proverb

reads, “Aurum Vergilius de stercore colligit Enni” (Vergil must gather gold from

Ennius’s manure). When trying to find useful information, most IT professionals feel the

same way.

1.4 Thesis Statement

The development of a model that identifies factors leading to a high-scoring

answer using natural language processing (NLP) and regression analysis will streamline

current searching capabilities in Stack Overflow and facilitate knowledge transfer by

increasing developer’s chances of finding answers they need.

Proper communication and documentation keep engineering teams more focused

and productive. According to DeMeyer (1991, 49), “The productivity of an R&D

engineer depends to a large extent on his or her ability to tap into an appropriate network

of information flows,” decreasing the number of software bugs and facilitating

deployments. For instance, knowledge professionals assisting large teams spend more than 13% of their time dealing with inefficient knowledge management processes

(Raytheon Professional Services LLC 2012). Indeed, the more data are available, the more complicated and expensive become the logistic challenges for handling and

8 managing such data (Koman and Kundrikova 2016; Nowlan and Blake 2007). Moreover, a substantial sample of all available documentation is inaccessible or simply lost in the clutter created by excessive data availability.

Compounding matters, the increase of mergers and acquisitions (M&As) within the IT industry and the resulting growth in the size of software-based companies is pointing toward a rapid industry concentration of knowledge and market heterogeneity, a trend that complicates the integration of knowledge repositories (Kalpic 2008). Indeed, exchanging written documents providing “visual anonymity and asynchronous interaction” entails arduous relationships and dealing with relational differences

(Szulanski, Ringov and Jensen 2016, 309).

Even though Stack Overflow has a growing repository currently storing millions of posts, it uses formalized standards to avoid data duplication and to remove irrelevant content. By using Stack Overflow to document solutions, engineering teams could cross- train their members and create a cloud-based back-up to store company knowledge

(Cancialosi 2017). Stack Overflow Teams was created with a specific purpose—to offer an online repository with private access. Private, cloud-based repositories ensure that knowledge professionals can delegate training to individual departments for greater efficiency. Instead of relying on textual documentation that requires time-consuming and expensive processing, teams can also use multimedia-based, interactive tutorials to make knowledge transfer programs more entertaining and appealing. However, no software initiative can succeed without the backing of a formal process for managing and transferring knowledge. Hence, training must be embedded as a part of the engineering department’s culture and mission (Cancialosi 2017), and the chief information officer

9

should be held responsible for the training’s success. By using Stack Overflow to create a knowledge repository, companies can transfer knowledge more effectively, decrease the amount of time invested in training, lower the number of software bugs, and complete projects in less time.

1.5 Research Objectives

The main objective of this research is the prediction of what constitutes an optimal answer in Stack Overflow. By considering the askers’ and community’s feedback, along with the answerers’ reputations and other metadata from a thread’s post, the proposed research product is intended to forecast whether an answer is optimal. The research questions posed relate to the average scores of different answer types and the variability in answerer’s reputation and geographical location in determining how accurate an answer is. Other factors considered were the number of views, word count, relevance, and user reviews.

The best or optimal answer is defined as one that is selected by both the community and the asker independently. Indeed, only a small percentage of solutions are optimal. For example, if a given question received nine proposed solutions, the one with the highest score is the best answer according to the community. However unlikely, there might be multiple answers with the same score; nonetheless, only a single solution14 can be selected by the asker. To clarify, a working solution might not be the best one, as it

14 It may contain multiple steps or stages, or different approaches for solving the same problem.

10 could contain performance issues or might have missed special scenarios, such as boundary values or states.

Both askers and the community can ignore answers or accept and rank solutions.

As depicted by Figure 1-3, an optimal community answer and an asker’s preferred answer need not be the same. Unanswered questions are not part of Figure 1-3, which simply intends to represent the optimal answer region in Stack Overflow. The optimal region is hereby defined as the overlap between the best answer according to the community (highest score) and the best answer according to the asker (personal preference).

All Solutions

Optimal Community Optimal Optimal Asker Answer (highest Region Answer (personal score) preference)

Figure 1-3. Stack Overflow and optimal answer region.

11

As a collateral objective, the present study seeks to optimize the way IT teams handle knowledge transfer processes via Stack Overflow. The proposed research product outlines a framework to predict high-scoring, relevant answers in Stack Overflow.

Achieving such an objective decreases the number of unused documents in internal knowledge bases and does away with long, useless manuals.

1.6 Research Questions and Hypotheses

This research answers instrumental research questions, as follows:

RQ1: What is the average score of accepted answers versus rejected ones?

(comparative)

RQ2: What is the standard deviation of the authors’ (source) reputations?

(descriptive)

RQ3: How frequently do question tags appear listed in the body of a chosen answer? (descriptive)

RQ4: What is the average number of comments per questions and answers?

(comparative)

RQ5: What is the range in the number of suggested edits per chosen answer?

(descriptive)

The research hypotheses draw conclusions via a methodology that applies to parameters of Stack Exchange usage, such that:

12

RH1: If an answer targets a question with a high number of comments, has less

recommended edits than other posts within the same thread, and is succinct (has less than

2,000 characters), then the answer will have a high relevance score.

RH2: If an answer contains a high cumulative tag score and a significant word

frequency (detected via word segmentation, bags of words, and stemming), then the

answer will more likely be selected as the asker’s preferred answer, thus becoming easier

to find.

RH3: If the model outlined in RH1 and RH2 is used in three case studies, more efficient/relevant data acquisition will occur, and Stack Overflow’s high-scoring topics will be easier to find (search optimization).

The first research hypothesis will analyze community-driven, high scoring answers to identify the factors driving relevance scores. For instance, it is hypothesized

that posts with more than six comments will likely have a high score. The population’s

average number of comments is three; hence, answers containing at least twice that

number of comments (i.e., more than six) will meet the specified criterion. Additionally, high-quality posts will have fewer recommended edits than other posts within the thread and contain less than 2,000 characters.

Similarly, the second research hypothesis identifies the factors leading askers to select a given answer as their preferred one. The average cumulative tag’s and bag of words’ scores are 5,054,237 and 50 respectively. Therefore, scores higher than

10,108,474 and 100 would be accepted.

13

The third hypothesis will utilize the factors identified in hypotheses 1 and 2 to build a conceptual framework and locate Stack Overflow answers faster. As mentioned

previously, answers have an average score of two. Any answer with a score greater than

five will be considered to have a high relevance score.

1.7 Scope of Research

The scope of the present research is twofold. First, it defines and manually-tests a

research product that could allow future researchers to understand the factors leading to

high-scoring answers in Stack Overflow. Second, it recommends the implementation of

the proposed research product in a tool (e.g., Python or C-sharp) supporting text analytics

and statistical analysis. In this way, thousands of posts can be analyzed in a matter of

seconds.

1.8 Research Limitations

This research is limited in several ways. First, the data for the praxis were almost

entirely restricted to the year 2018. Stack Overflow’s maturity exceeds eleven years,15

and the quantity and quality of posts from the years following its inception might be

vastly different from the ones used here. Using a dataset from a different time period

would result in significantly different outcomes, considering that Stack Overflow was not

as popular and/or influential in 2008 as it is today.

15 Stack Overflow was founded in 2008 by and .

14

Second, since Stack Overflow is a highly-technical forum that does not favor social interactions, this praxis’ conclusions will not be applicable to online forums or junior or retired programmers and non-college graduates at large. Having the goal of getting better, more reliable results, the site expects focused answers to problems that prevent developers from doing their jobs.16 Similarly, strict rules are enforced with a fair level of unspoken tension, making mid- and senior-level developers—and not beginners—Stack Overflow’s target audience. A related limitation is users’ bias and fallibility, as individual askers or even the community can select a suboptimal answer based on the way it is written (e.g., it is entertaining or easy to understand).

Third, the present praxis develops a framework that Stack Overflow’s users or administrators/moderators would use; however, it does not include specific implementation details of the software that would automate the process described in the framework.

1.9 Organization of Praxis

The current chapter introduces knowledge transfer and its limitations. It also explores Stack Overflow’s main features and intended audience. The second chapter offers a literature review covering knowledge transfer, defining the concept of information, describing online forums, and analyzing how methods used in related research informed the present praxis. The third chapter explores the methodological approach used—which involves text analytics, partial least squares (PLS), and binary

16 Also known as an impediment or blockage in the Agile/Scrum methodology.

15

logistic regression (BLR)—and explains how the data was collected and sanitized.

Chapter four summarizes the results of the research, and chapter five enumerates the

conclusions, contributions, and recommendations of the praxis. Sources and appendices are included at the end of the paper.

16

Chapter 2: Literature Review

2.1 Introduction

The definitions of information, knowledge, and knowledge absorption and

transfer have been developed extensively in the literature. Such concepts are pivotal to

understanding online forums and Q&A communities, including those catering to

technical users. Stack Overflow’s popularity is heavily dependent on information flows and trust between users.

Additionally, the platform’s user experience (UX) not only relies on availability and usability but also on the successful implementation of a gamification17 paradigm. The

use of reputational scores, badges, and bounties drives answerers to compete against each

other. A high reputation unlocks new privileges and grants access to moderation tools;

badges are considered special achievements and are issued at three levels: bronze, silver,

and gold; and bounties tend to redirect traffic to questions with high visibility, making

them look like missions.18 Therefore, the user offering the highest-scoring answer is

considered to be the thread’s winner, thereby garnering an elevated standing within the

community. Stack Overflow’s Leagues display a scoreboard with the highest ranked users by week, month, quarter, year, and all time.19

17 Gamification is the use of elements and techniques usually found in video games to make a product more appealing and engaging.

18 Sources: https://StackOverflow.com/tour and https://www.YouTube.com/channel/UC2hxYQtGLEkcOMK4h8JRycA/videos.

19 Source: https://StackExchange.com/leagues/1/alltime/stackoverflow. The highest-ranking Stack Overflow user is Jon Skeet, a Java and C-sharp software developer for Google, considered an authority on date and time algorithms. He has become almost a mythical figure within the Stack Overflow community,

17

No research can be successful in determining practical ways the forum can be

improved without a full understanding of the type of information Stack Overflow users

expect and the format that it follows. Moreover, it is important to assert that Stack

Overflow’s true contribution is its ability to offer answers to complex coding questions

and contribute to the success of programming projects. Software developers see the

benefit in data availability but value data readiness even more.

2.2 Information, Knowledge, and Related Concepts

Information can be defined as a set of organized data patterns, but knowledge is

intrinsically more permanent, publicly good, non-excludable, multilayered, and often

indivisible (Bock and Kim 2002; DeMeyer 1991; Roberts 2000). Information is made up

of facts and symbols, whereas knowledge involves skills and expertise; hence, knowledge

is interpreted, personalized, or contextualized information (Zhang 2007).

Knowledge modes are multi-dimensional and are defined as tacit, embodied,20 encoded,21 embrained,22 procedural,23 and embedded24 (Joshi, Sarker and Sarker 2007).

Furthermore, knowledge processes involve the discovery, capture, sharing, and

as all of his [almost 40,000] answers are upvoted and over 200 million visitors have seen his solutions, saving an estimate of 56.8 million hours of development time. He was the first person to win over one million reputation points in Stack Overflow.

20 Partially articulated.

21 Stored in data banks.

22 Ability to interpret underlying phenomenological patterns.

23 Process-based understanding.

24 Contextual, not pre-given.

18

application of information (Alawneh 2016) for improving the success of IT projects,

online collaboration, and the mitigation of risks.

Tacit25 knowledge represents an intuitive understanding grounded in expertise

that is not only harder to articulate than explicit, formal, systematic, codified, or

structured knowledge but cannot be stored in relational databases. Tacit knowledge can

be divided further into collective and overlapping/specific knowledge. Tacit knowledge,

as opposed to articulated knowledge, is informal. It is shared via person-to-person exchanges and spreads locally and across networks of people interested in similar topics26

(see Figure 2-127; Jacobs 2018). Indeed, Stack Overflow relies on the ability of users to

translate implicit knowledge into explicit knowledge by allowing answerers to articulate

solutions that can be stored and later evaluated by the community.

25 Unstructured and actionable knowledge are subsets of tacit knowledge.

26 The interest graph is a digital representation of a given user’s interests. Stack Overflow, as opposed to a social networking site, connects people via interest-links.

27 Source: https://Upload.WikiMedia.org/wikipedia/commons/2/23/Interest_graph_vs_social_graph.png. Labeled for reuse.

19

Figure 2-1. Interest graph.

Knowledge diffusion relates to knowledge shareability and is positively correlated with codifiability. For instance, confidential information tends to be undiffused.

Knowledge diversity, however, is uncorrelated with respect to shareable knowledge

(Slaughter and Kirsch 2006). Even when some tacit knowledge cannot be codified and becomes ambiguous, it can still be shared both internally and externally through actions and practices (Ihrig and MacMillan 2015; Szulanski, Ringov and Jensen 2016). Other categories of knowledge are know-why,28 know-how,29 know-who,30 and know-what31

(Roberts 2000; Santhanam, Seligman and Kang 2007).

28 Causal and contextual.

29 Procedural.

30 Selective social relations.

31 Declarative and factual.

20

Information systems projects are “collective, collaborative, complex, knowledge-

intensive, and creative efforts” (Alawneh 2016, 1). Information technology (IT) is a

subset of information systems (IS). Even though “one of the most important factors contributing to [IT] success is communication and documentation” (DeSimone et al.

1995, 17), one can argue that knowledge repositories can grow so quickly that they render themselves obsolete. Having a predictive framework that retrieves a good answer in a timely matter could be extremely useful to Stack Overflow’s effectiveness and outreach.

2.3 Digging into Stack Overflow

Developers are usually redirected to Stack Overflow from Google’s top search

results, given that Stack Overflow is regarded as a “code-centric knowledge base” (Yang

2016, 225). Stack Overflow also facilitates the process of sharing experiences and expertise in the area of computer engineering.

Stack Overflow is a crowdsourcing platform for knowledge transfer among software developers. Programmers use Stack Overflow as a means to gain knowledge specifically related to “programming languages, API use, configuration management, web frameworks, and web browsers” (Abdalkareem 2017, 55); it allows the community to document bugs and ongoing issues. Some recommendations to improve effectiveness within the site are to “provide direct feedback, assess code snippet quality, and link changes to discussions” (Abdalkareem 2017, 57). Oliveira (2018) proposed connecting users as part of the community and promoting bond-based relationships.

21

Vasilescu (2013) claimed participation in Stack Overflow does not interrupt

developers’ working rhythm; nor does it slow productivity. Highly productive GitHub

committers tend to adopt a more professorial attitude toward Stack Overflow’s forums

and transfer their knowledge in a more egalitarian way via micro-, intermediate-, and

macro-analysis. Vasilescu’s paper informed this research in terms of linking Stack

Overflow to other platforms, but it did not address the quality of answers posted or the

relation of this quality to effective knowledge transfer.

Previous research has been conducted in Stack Overflow to improve the

maintainability of the site and optimize its responsiveness. Zhang et al. (2015) outlined

how duplicate questions make site maintenance harder, and his solution—

DupPredictor32—outperformed Stack Overflow’s search engine by 40.63% (Zhang et al.

2015).

Chua (2015) used metadata33 and content34 for a predictive framework to determine how a high number of downvotes, a low number of tags, and a small number of characters within a question would increase the question’s likelihood of being answered. Yang (2016) analyzed code snippets in Python, JavaScript, Java, and C-sharp to determine answers’ correctness and completeness.

32 DupPredictor uses Latent Dirichlet Allocation (LDA) to extract topic distributions; Porter stemming to reduce words to their root form; and WVTool to remove stop words.

33 Popularity, participation, asking time, derived role, and derived popularity.

34 Level of detail, specificity, accuracy, clarity, and socio-emotional value.

22

Oliveira (2018) has argued that Stack Overflow exhibits a participation imbalance

between United States users (individualistic, contributors, active) and Asian users

(collectivist, lurker, passive) via a value-sensitive-design (VSD), tripartite methodology.

Additionally, Oliveira (2018) has asserted that values (such as productivity and

reputation) showcased in Stack Overflow, better align with a Western, individualistic

perspective. Stack Overflow can be considered a structural resource, as the site and its users establish a dynamic feedback loop.

Zhang (2018) proposed a technique named RASH that, via lexical and historical analysis of application programming interfaces (APIs), maps APIs with 70% accuracy and accelerates the resolution of questions and the saving of developer’s time. This is relevant because, when compared to other question types, API-related questions take three days longer for answers yet are viewed 200% more, according to Zhang (2018).

Ninety-two percent of Stack Overflow’s questions are answered with a wait-time averaging 11 minutes for the first answer and 24 days to get a solution or accepted answer (Masudur-Rahman 2018). Masudur-Rahman (2018) created a prediction model to detect the best answer (defined as the most upvoted) within an unanswered question by using five metrics: answer rejection rate, last access delay, topic entropy, reputation, and vote to which applied four different methods: lexical (readability), semantic (topic similarity), user behavior, and popularity. However, Masudur-Rahman (2018) did not apply his findings to already-answered questions.

23

2.4 Knowledge Transfer

Knowledge transfer means getting the right information to the right people at the

right time (Cancialosi 2017). There are three epistemological stances considered for

knowledge transfer: cognitivistic,35 connectionistic,36 and autopoietic37 (Joshi, Sarker and

Sarker 2007).

Indeed, knowledge transfer has been a cornerstone for social and technological

advances throughout history. For instance, the printing press was a major improvement for transferring knowledge over oral tradition. Some classical works, such as The

Aeneid,38 The New Testament,39 and The Divine Comedy40 garnished more popularity

simply by having the right language channel. By selecting the right mode of

communication, knowledge transfer becomes easier to manage and ultimately more

effective.

Nonetheless, technology products are different from works of art, as they are

often the outcome of substantial embedded/internalized tacit knowledge. The present

praxis argues against the widely-accepted idea that mostly tacit (as opposed to explicit)

knowledge creates a strategic advantage by defying the economic law of scarcity. Also, it

35 Fixed, stored data.

36 Knowledge transfer resembles interconnected networks.

37 Self-produced, autonomous, co-evolved, unified, and unshareable.

38 Written in Latin.

39 Written in Greek.

40 Written in Italian.

24

challenges the perception that a technology’s popularity makes it more accessible or

easier to understand.

In Laudon and Laudon (2006), Proctor & Gamble (P&G) solves the problem of its

annual growth rate slowdown from 5% to 2.6% via knowledge management-driven

initiatives and information systems. The case study describes how P&G implemented

InnovationNet for bolstering collaboration, enabling better surveys, removing siloes,

implementing agent-based modeling and simulations, and triggering synergy across

teams. The authors observed that software like MatrixOne was used to automate and

standardize knowledge components.

Most large companies understand the difficulty of transferring knowledge from

experts to novices. Employees may feel overwhelmed by the amount of information they

are expected to master and discouraged by the lack of formal processes to guide them.

Changchit (2003) proposed an intelligent and interactive system that measures perceived

usefulness and ease of use to achieve such a goal. Besides, good knowledge transfer

practice and unconscious learning of managerial systems, norms, and values take

employees to an expert level via pattern-recognition (Swap et al. 2001). Nonetheless,

formal knowledge transfer processes can stifle disruptive innovation models.

Some solutions have been proposed regarding knowledge transfer optimization; however, none of them deal with the issue of documentation removal, peer-review, or updating. Alawneh (2016) proposed a framework for dealing with experts’ diversity when preparing a memory repository. Geiger (1994) suggested an evaluation technique

25

for tacit knowledge. Dillon, Graham, and Aidells (1972) analyzed delivery methods41 for

both individuals and groups. DeMeyer (1991), Roberts (2000), and Sarker et al. (2005)

elaborated on the ability of information and communication technologies (ICTs) to

enhance the transferability of knowledge, even though they acknowledged the relevance

of face-to-face interactions vis-à-vis relational proximity or localness of knowledge.

Iyengar, Sweeney, and Montealegre (2015) used a MIMIC (multiple indicators and multiple causes) model, which applies statistical validation techniques for a set of formative indicators measuring internal IT use, knowledge transfer effectiveness, absorptive capacity, and financial growth. Slaughter and Kirsch (2006) analyzed the impact of knowledge transfer on software process improvement (SPI) initiatives in terms of composition (types of mechanisms) and intensity (how frequently mechanisms are used), converging into accumulative techniques.

Cancialosi (2017) recommended six steps42 for achieving effective knowledge

transfer. Guo (2016) developed a joint learning framework that combines transfer43 and

active44 learning, considering the different source and target domains for sample labeling

of seed parameters, cross-class knowledge transfer, and statistical classifiers. Zero-shot

learning uses models without labeled data45 (Guo 2016). Rohrback (2011) pointed out

41 Video and practice.

42 Make it formal (1), create duplication (2), train, train, train (3), use systems (4), create opportunities (5), and be smart when using consultants (6).

43 System-driven.

44 Human-driven.

45 Unlabeled data can lead to information loss.

26

that traditional multi-class classification and error-prone zero-shot learning methods tend

to use a limited number of object classes.

Akshatha (2017) studied knowledge transfer’s impact on job satisfaction and

employee turnover. However, females, top management, and employees with five or

more years of experience in their roles are underrepresented in the study. According to

the United States Department of Labor, workers between the ages of 55 and 64 have three

times the tenure46 and job security that workers between the ages of 25 and 34 have

(Bureau of Labor Statistics, United States Department of Labor 2012). Furthermore, the

lack of educational background in non-professional jobs tends to influence tenure

negatively. Computer and engineering occupations outperform most others in terms of tenure, except for careers with the federal government.

The following research papers have potential applicability to the field of knowledge management and highlight important techniques when transferring knowledge in online forums. For example, He (2018) and Nowlan (2007) used intelligent agents47 for faster decision-making via collaborative filtering48 as a solution to data sparsity.49

Wang (2010) explored semantic-oriented knowledge transfer for review rating.50

Shahbandi (2018) optimized error-prone mapping algorithms in the field of robotics.

46 Amount of time a person has worked for a given employer.

47 Human proxies containing a brain, body, society, and human-agent interaction.

48 Collaborative filtering can be narrow or general, being a technique applied by recommender systems.

49 This concept is very common in natural language processing and it relates to the problem of not observing sufficient corpus’s data to accurately model a given language. It is also known as data sparseness or paucity.

50 Transfer rating.

27

Jiang (2018) recommended a novel algorithm for auxiliary textual datasets called transfer

spectral clustering (TSC). The research outlined above deals with the optimization of

text-based knowledge, and a few of the techniques presented, such as collaborative

filtering, were partially used for this paper’s data analysis.

On top of optimization, one can see how trust, reputation, and culture all have a

considerable influence on knowledge management. “Knowledge tends to move

horizontally and vertically along structural lines” (Slaughter and Kirsch 2006, 305), and it

drives IS profitability due to the knowledge resources becoming scarce and hard to

purchase (Zhang 2007). As recommended by Gist (1989) three decades ago, behavioral modeling51 can be beneficial for knowledge transfer.

Simon et al. (1996) compared traditional and non-traditional computer training

techniques in a military setting and considered trainees’ cognitive abilities, motivations,

and training environment. For Simon et al. (1996), knowledge was a process and not a

product. Amiryany (2014) researched knowledge-based acquisitions failures, especially

within high-tech companies. Amiryany (2014) divided them into formal acquisition

structures, communication tools and practices, and on-the-job learning activities. Chang

(2012) examined the positive impact52 that IT-related software outsourcing has on technical knowledge (Chang and Gurbaxani 2012), economies of specialization, and

51 Process demonstrating the behaviors driving performance and efficacy.

52 Up to 6% gains.

28

accumulated knowledge correlating transfer with intensity, leverage, size, and consistency (Chang and Gurbaxani 2012).

Nevertheless, companies usually seek outsourcing solutions due to scarce resident capabilities; hence, expected benefits might be overestimated, and tacit/implicit

knowledge transfer could be nonexistent. Knowledge transfer improves due to the

external nature of the provider, and wage benefits—not knowledge transfer

optimization—drives business decisions. More importantly, companies often do not

consider online forums, such as Stack Overflow, to manage their knowledge repositories

due to security and legal concerns. Stack Overflow can indeed manage private knowledge bases for private and public companies alike.

A segment of the research literature is oriented toward profitability. Ihrig and

MacMillan (2015) introduced a two-dimensional map going from tacit to explicit53 and

from proprietary to widespread.54 The goal of this Cartesian model was to shift data

points toward the top-right corner, with the primary goals being to map knowledge assets,55 assemble multi-functional teams within diverse unit levels, identify new

opportunities—such as licensing—and contextualizing and re-discovering knowledge for

new applications.

One of the main barriers to knowledge transfer adoption relates to non-codified

data. It is estimated that 90% of all stored data is unstructured. Furthermore, with the

53 Y axis: going from unstructured to structured.

54 X axis: going from undiffused to diffused.

55 Knowledge assets can be hard/technical or soft/managerial.

29

ubiquity of digital communication and the prevalence of an aging workforce (Benabou

and Benabo 1999) comes an increase in the number of knowledge repositories, the

maintenance of which becomes harder due to “data explosion and information overload”

(Delen and Al-Hawamdeh 2009). Far from resolving the problem, new technologies— data mining, data warehousing, web crawling, and cheap storage—have made it more pervasive. For instance, Koman and Kundrikova (2016) claimed that technological development triggers constant information production and pointed out that one-fourth of managers are confused by the concept of big data. Good knowledge management—core competencies, areas of expertise, intellectual property, and deep pools of talent (Ihrig and

MacMillan 2015; Khandelwal and Gottschalk 2003)—must avoid the big data trap.

There are several proposed solutions in the literature, but only a fraction of them deal with the issue of massive knowledge repositories. Moreover, there are even fewer solutions that take into account the value of online repositories. As pointed out by Simon

(1971) almost 50 years ago, “A wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it” (Simon 1971, 40-41).

Delen and Al-Hawamdeh (2009) developed a holistic knowledge management framework within a federated system that addresses a lack of emphasis on textual information. Their methodology was to harvest tacit knowledge via interview sessions, mentoring, [memorable, truthful, and positive] story-telling, analogies, and metaphors56

(Swap et al. 2001). The researchers’ sequential, three-step process included extracting

56 Personalization, socialization, and internalization strategies.

30

data from information sources, consulting know-how materials, and finally generating

actionable knowledge. The drawback of such an approach is that knowledge transfer is

more efficient within an iterative, network-based, and collaborative environment with

feedback effects; not within a step-by-step recipe.

Kim, Song, and Jones (2011) explored a social-cognitive selection framework for

a knowledge acquisition strategy in virtual communities via goal-setting theories from a demand-and-supply perspective. This research has provided abundant information on

acquisition methods within virtual communities; it has not, however, expanded on inter-

group knowledge transfer.

Finally, the knowledge engineering paradox states, “[A] more capable individual

is unable to successfully explicate his/her knowledge in forms that can be internalized by

other less capable team members” (Joshi, Sarker and Sarker 2007, 331), especially as the

knowledge gap increases. In fact, high-performers may be detrimental to departmental

knowledge transfer goals. The reason, according to DeSimone et al. (1995), is that expert

innovators work in spurts and reach their highest levels of productivity only rarely. Here,

one encounters a conundrum: Content quality decreases as more people contribute to the

knowledge base (further discussed in the following sections).

2.5 Online Forums

Even when they are hardly mentioned in developer meetings and are infrequently

referenced within internal documentation, online forums are vital to software developers’

successes. Q&A sites and technical forums depend on content quality and communal trust. For instance, Attiaoui (2017) used the belief measure of expertise (BME) for the 31

detection of authoritative users; authoritative users are a major drawback within online

communities. Specifically, any Stack Overflow user with a reputation greater than 2,400

points is regarded as an expert, without considering the number of accepted answers the user has contributed to the site.

Szulanski, Ringov, and Jensen (2016) claimed that judicious knowledge transfer timing contributes to increased delivery accuracy, regardless of whether a front-loading57

or back-loading58 mode is used. The researchers asserted that knowledge transfer

becomes sticky when it is noteworthy. Although foundational to domain knowledge

transfer, Szulanski’s research mostly ignores the role of online communities.

Santhanam, Seligman, and Kang (2007) researched knowledge transfer processes

and routine work elements during the post-implementation stage. Additionally, they

evaluated network effects with end users and colleagues within and between groups. As

they pointed out, transactive memory59 can be detrimental to proper knowledge transfer.

Joshi, Sarker, and Sarker (2007) explored the impact of knowledge within

development teams that support information systems. They found capabilities that do not

tend to play a significant role but seek a degree of resonance between senders and

receivers. The authors also explored communication theory in terms of messages,

senders, receivers, channels, transmissions, and communication effects. Even though the

57 More affordance for tacit knowledge exchange is allocated to the initiation of the transfer than to its implementation.

58 Less affordance for tacit knowledge exchange is allocated to the initiation of the transfer than to its implementation.

59 When person A uses person B as a memory aid.

32

article offers advice in the area of knowledge transfer, it mostly ignores the importance of

online forums for IT teams.

According to Sarker et al. (2005, 214), “Not much research has been conducted to

examine knowledge transfer within groups in virtual communities, especially those that

span time and space (i.e., virtual teams).” However, Sarker et al.’s research could benefit

from validation through the use of a comprehensive dataset.

A major problem with online forums is the failure to attract quality answers.

Hence, maximizing answerability and answer quality rates would be beneficial for both

the platform and the community of users it hosts. By using natural language processing

(NLP) for measuring users’ engagement, Kowalik (2016) found that elderly60 users of

Stack Overflow tend to have higher reputational scores; similarly, seniors61 answer slightly more often than juniors but ask half as many questions.

2.6 Summary and Conclusions

After exploring the concepts of information and knowledge, one appreciates their relevance to the area of online forums and specifically to the success of Stack Overflow.

Stack Overflow’s main goal is to transfer knowledge between software developers effectively and improve upon existing techniques used in Q&A sites by applying

60 In terms of age and not technical expertise.

61 In terms of technical expertise and not age.

33

gamification and other related techniques, such as token economy, for behavior

reinforcement.

Therefore, it is paramount to determine which factors are significant in Stack

Overflow’s posts. Identifying such factors would optimize the site’s responsiveness and

decrease the amount of time users spend finding an answer.

The domain of knowledge management is filled with challenges. Those were considered while reviewing the literature, and some of the shortcomings found are listed below:

1. Who decides who the expert (knowledge originator) is? Most companies look at

data and disregard sources.62 Sources are critical to knowledge acquisition and

can be divided into three categories: dyadic,63 published,64 and grouped65 (Kim,

Song and Jones 2011; Sarker et al. 2005).

2. Some managers adopt big data as a panacea, ignoring that sometimes “less is

more.”

3. Close-mindedness, biases, and overconfidence are underestimated when it comes

to knowledge management.

62 Static or dynamic.

63 One-to-one dialogue between the knowledge recipient and the provider via direct communication.

64 Many-to-many relationships between knowledge providers and recipients that can be helpful to noise-, fidelity-, and credibility-related factors.

65 Open-venue exchange of knowledge among multiple recipients and sources.

34

4. Users ignore that, besides people, artifacts, and organizational entities (such as

Stack Overflow) can also transfer knowledge (Alawneh 2016).

5. Knowledge transfer represents more than a back-up plan kicking in when critical

members of the team leave (Cancialosi 2017); it prevents members of the team

from leaving. In other words, an expert’s letter of resignation should never kick-

off knowledge transfer.

6. There is a lack of video recordings (DeMeyer 1991; Zhang 2007) of code reviews

and lunch and learn sessions.

7. Learning materials tend to be long and convoluted (Jacobs 2018).

8. Prevalence of medium dysphoria, which is when users get confused learning from

similar materials stored in different formats.

9. Knowledge repositories become a disorganized dumping ground and contribute to

organizational waste (Jacobs 2018).

10. Peer-review is commonplace in academia but barely applied to IT departments’

knowledge bases.

11. Competence/reputational trust66 is often regarded as more important than

benevolence/emotional trust (Khandelwal and Gottschalk 2003; Ko 2010; Roberts

2000). Also, managers fail to take advantage of consultants’ expertise and Q&A

forums. Reputation tends to serve as an information filter due to the massive

amount of available information (Sarker et al. 2005).

66 Formal process.

35

12. Reverse and spontaneous mentoring67 have an impact on organizational

knowledge, work effectiveness, and job success/mobility (Geiger 1994; Quick

MBA 2010; Scandura 1992). Starcevich (1999) recommends a “power-free, two-

way, mutually beneficial relationship” between mentors and protégés.

13. Few companies conduct knowledge audits or appoint a CKO (Chief Knowledge

Officer), Vice President of Knowledge Management, CLO (Chief Learning

Officer), CICO (Chief Intellectual Capital Officer), CINO (Chief Innovation

Officer), or CPO (Chief People Officer) (Babcock 2004).

14. Unwillingness to change, archaic corporate politics, and irrelevance affecting

knowledge-based programs (Babcock 2004). Hence, knowledge management

should be a company-wide problem, not an IT problem.

15. Only considering materials from the same field and industry, usually exclusively

available in English.

The behavior- and culture-driven solutions found in the literature are extensive but do not cover the problem of excessive data availability. DeSimone (1995),

Khandelwal (2003), and Zhang (2007) describe the importance of incentives and reward systems, attitudes toward risk and reward, hiring and training initiatives (DeSimone et al.

1995; Khandelwal and Gottschalk 2003; Zhang 2007), on top of informal, as-needed, spontaneous, and circumstantial (Benabou and Benabo 1999) processes for creating a thriving organizational culture that encourages knowledge transfer. Nonetheless, Bock

67 Protégé-to-mentor.

36

and Kim (2002) claim that incentives, as opposed to motivations, are counterproductive.

Stack Overflow agrees with this view.

IT developers should never have to seek advice from the nicest or more

experienced teammate but from the most knowledgeable one. As it turns out, “Bigger is

[not] more important than better” (Kalpic 2008, 4). It is not information volume (Sarker

et al. 2005) that matters but diversity, veracity, sparsity, and velocity (Ihrig and

MacMillan 2015; Koman and Kundrikova 2016).

The Stack Overflow’s platform comes with several embedded problems. For example, Ragkhitwetsagul (2017) surveyed Stack Overflow registered users and visitors68 to evaluate outdated code, their legal implications, and their detrimental effects when introducing vulnerabilities. Unfortunately, most programming languages have a short lifecycle, causing many good, reliable articles to be deprecated within a matter of months.

Also, virtually no answerers include software licenses in their code snippets. Sixty-nine percent never validate against licensing69 conflicts. Nine percent of developers copy code on a daily basis. Sixty-four percent actively reuse code from the site and find problems with it but never report it, and nine percent experience legal issues.

In the following chapter, this research will explore novel techniques for collecting data and finding optimal answers within Stack Overflow, at the same time analyzing

68 Unregistered users.

69 In accordance with the Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) document.

37 significant factors that contribute to more efficient knowledge transfer among technical users.

38

Chapter 3: Methodology

3.1 Introduction

This chapter categorizes Stack Overflow as a data repository and displays the

praxis’ outline in a flowchart that includes the study’s methodology. The chapter goes on

to describe how the data70 was collected and sanitized—a vital step before conducting

text analytics. The last section introduces the uses of text analytics and multiple

regression analysis, specifically partial least squares (PLS) and binary logistic regression

(BLR), as methodological approaches in the praxis. The predictive model will help to

find optimal answers within Stack Overflow and, in so doing, identify significant factors

contributing to more efficient knowledge transfer among software developers.

Delen and Al-Hawamdeh (2009) described a method for posting an unstructured

question to a generic system in Figure 3-1.71 The steps described are adopted by Stack

Overflow almost in their entirety. Table 3-1 relates Delen’s holistic framework to Stack

Overflow’s behavior. The table serves as a recapitulation of the definitions categorizing

Stack Overflow as a knowledge repository.

70 All referenced data was available in the Stack Exchange database and could be accessed via SQL queries.

71 Source: Dursun Delen, Suliman Al-Hawamdeh. Workflow for posing a query or an unstructured question to the system in A Holistic Framework for Knowledge Discovery and Management (Communications of the ACM 52 (6), 2009), 144. Used with permission.

39

Figure 3-1. Delen’s holistic framework for knowledge management discovery.

40

Table 3-1. Mapping Delen’s Holistic Framework into Stack Overflow

Delen’s Holistic Framework Stack Overflow Equivalent Start interaction User accesses Stack Overflow’s website via a web browser or

mobile app. Authentication is optional.

Submit query Users try to find an answer to a question.

Search knowledge repository The Stack Overflow search engine looks for a valid answer.

Answer found / If an answer is found and it is satisfactory, the user ends the

satisfaction level interaction (best-case scenario). If the answer is found, but it is

not satisfactory, the user creates, refines, and submits a new query

or question.

Answer not found / If there is no answer for a posted question, other members of the

satisfaction level community step in to offer a resolution. If at least one proposed

solution is valid, the user interaction ends; otherwise, the process

restarts.

Create knowledge nugget Each question, answer, comment, or suggested edit becomes a

for KD post.

Knowledge Depository (KD) All posts are stored in the Stack Exchange Data Explorer.

Note. A few steps from Delen’s framework are ignored by Stack Overflow, such as identify human experts, consult with human experts, identify other human experts, and direct user to specialized discussion boards. Indeed, one goal of Stack Overflow is to keep discussions focused and concise and to eliminate unnecessary steps that could delay askers from getting their questions answered. Source: see footnote 71.

41

Supportability analysis72 (see Figure 3-2), a technique commonly applied to

logistics management to bolster efficiency, was applied in the planning of this paper. The

process is a sequence of eight steps. After evaluating the research requirements (step 1)

and determining the problem and thesis statements (step 2) in accordance with current

research (step 3), SEDE data was retrieved and sanitized (step 4). Upon selecting two

statistical methods (step 5), a new conceptual framework was built (step 6) and tested

(step 7). The analysis finishes with conclusions and recommendations (step 8), urging

other researchers to implement the proposed framework following standard system development procedures.

72 Source: Military Standard 1388-1.

42

Figure 3-2. Supportability analysis.

3.2 Data Collection and Analysis

The data source for this research is the Stack Exchange Data Explorer (SEDE) repository, which contains a comprehensive data dump of Stack Overflow’s posts. The

43

SEDE73 was updated in September 2019 and is publicly accessible. It stores almost 400 files and 60 gigabytes of data. The repository contains questions, answers, comments, tags, and user information from Stack Overflow.

The Stack Exchange API is an online, freely available platform with read-only rights that allow researchers to query the Stack Overflow’s data repository and retrieve near-real-time information about the site’s activities. Results are limited to 50,000 records per query, probably to prevent denial of service (DoS) attacks and to guarantee the SEDE’s responsiveness.

With thousands of posts entered into Stack Overflow daily, data collection and cleaning was a challenge in this study. Several steps were needed to ensure the sample data was representative of the population. First, posts were categorized via a stratified74 and systematic75 [without replacement] random76 approach. Second, case studies were

used to validate the reliability and accuracy of the results. Such a testing methodology

73 Sources: https://Meta.StackExchange.com/questions/2677/database-schema-documentation-for-the- public-data-dump-and-sede and https://Data.StackExchange.com/stackoverflow/query/835150/list-all- fields-in-all-tables-on-sede.

74 This is a statistical method that breaks the population into subpopulations and then into samples from each subpopulation. It introduces certain risks. The criteria for determining subpopulations might be biased, and convenience sampling might result. Convenience sampling is a non-randomized method that simply looks at the data that is most easily available; e.g., the first ten objects of a set.

75 This is a statistical method that selects samples from a population following ordered framing. For example, in a nine-person population where each person is assigned a unique, sequential identifier, a systematic sampling would be selecting elements three, six, and nine.

76 Random sampling ensures that all elements in the sample are selected by chance, considering that each element had the same probability of being chosen.

44

established benchmarks to corroborate the accuracy of the proposed model’s predictive

power.

The critical areas for data collection and the interrelations within the data are

illustrated in Figure 3-3. The proposed model considers question and answer posts’

metadata, user information, comments, feedback, badges, tags, and votes (see Appendix

A).

Figure 3-3. Relationship between critical data fields collected.

45

The data was transformed using a variation of the Calefato approach (Calefato

2018). The Calefato approach converts variables with a high variance into numerical categories. For example, the literature observes that users’ reputations in Stack Overflow are highly variable—scores range between 1 and over 1,000,000—and prevent even distribution of information across different topics. Through the use of the Calefato method, new users received a value of 1,77 low reputation users a value of 2,78 established

users a 3,79 and expert/trusted users a 4.80

As part of the data cleaning process, generic questions such as “I keep having

problems” or discussion triggers like “Tell me what your favorite programming language

is” were removed from the dataset. Stack Overflow favors answers that are concrete and

verifiable, disfavoring answers that are vague, open-ended, and conducive to a

fragmented discussion. This does not imply, however, that there is but a single way of

solving concrete problems.

The current praxis uses quantitative methods (i.e., regression analysis) for data

analysis. Additionally, the proposed framework evaluates post quality and determines the

conditions that lead to successful answers— having the highest community score in the

77 Current score < 10. Reputational scores cannot have a negative value as the minimum score will always be one, no matter how many times a user is downvoted. Historical reputation graphs, however, can contain metadata to determine a user’s true score, which can be negative. For simplicity’s sake, this research assumes all reputational scores in Stack Overflow to be positive integers greater than or equal to one.

78 Current score in the range [10, 1,000).

79 Current score in the range [1,000, 20,000].

80 Current score > 20,000.

46 thread and being selected by the user who posted the question—in Stack Overflow. The results are presented in chapter 4.

3.3 Research Methods

The following sub-tasks are warranted by the need to ensure the validity of the predictive framework that is developed as part of chapter 4:

1. Collect a sample from the SEDE website and ensure that the sample is both

representative and systematic.

2. Compare and contrast answers with a high positive score (upvoted) versus

answers with a negative score (downvoted). This process facilitates the

identification of the factors driving scoring mechanisms within Stack Overflow

and their relationship to high-quality posts.

3. Determine if, as hypothesized, an answerer’s reputation is the driving factor

leading to answer selection. For example, if a question is answered by a member

having a higher-than-average reputation, then the answer has better odds of being

selected.

As previously stated, the following research hypotheses will be evaluated:

RH1: If an answer targets a question with a high number of comments, has less recommended edits than other posts within the same thread, and is succinct (has less than 2,000 characters), then the answer will have a high relevance score.

RH2: If an answer contains a high cumulative tag score and a significant word frequency (detected via word segmentation, bags of words, and stemming), then the

47

answer will more likely be selected as the asker’s preferred answer, thus becoming easier to find.

RH3: If the model outlined in RH1 and RH2 is used in three case studies, more efficient/relevant data acquisition will occur, and Stack Overflow’s high-scoring topics will be easier to find (search optimization).

Due to the neutral style of Stack Overflow’s posts, sentiment analysis and affinity tables were not the best methods for evaluating the content of questions, answers, and comments. Another factor that contributed to this choice was the existence of strong vetting, social grooming, and heavy editing against emotionally-leaning posts within the platform.

The research product, depicted below in Figure 3-4, justifies the use of the research methods—NLP, PLS, and BLR.

NLP branches out from artificial intelligence with the goal of deciphering human language in a way that is understandable to computer systems (Garbade 2018). NLP is hard to implement due to the nuances of human communication, permeated by slang and sarcasm. Two techniques used by NLP are syntax and syntactic.81 Most NLP conducted

on this praxis was applied via the ProWritingAid tool.

PLS will be applied to determine whether an answer has a high score or not. PLS

is based on multivariate regression and principal component analysis (Roy 2015). Even

81 Some syntax methods are: lemmatization, morphological segmentation, word segmentation, part-of- speech tagging, parsing, sentence breaking, and stemming (Garbade 2018). Some semantics-based processes are: named entity recognition (NER), word sense disambiguation, and natural language generation (Garbade 2018).

48

though this statistical method is often used when more than one dependent, non-binary variable exists; the nature of answer scores within Stack Overflow made it a perfect fit for the analysis. In fact, PLS is less likely to cause overfitting than linear regression (Roy

2015) as it simultaneously addresses variability and correlation, which appear to be endemic problems in Stack Overflow. Furthermore, PLS is able to handle data noise competently.

Similarly, BLR is regarded as a statistical classification technique that analyses a set of independent variables to predict a categorical—usually binary—dependent variable. This justifies its use for predicting when an answer is more likely to be tagged as the preferred solution.

Figure 3-4. Research product.

49

Upon completing the data collection and preliminary cleaning process,82 the methodology to find the best (optimal) answer was divided into four major areas: data preprocessing, text analytics, post metadata, and author metadata.

First, data preprocessing (see Figure 3-5) involved balancing and randomizing the collected post sample for the year 2018 in a way that the data used was evenly distributed across the entire year. For example, assuming there were a total of 1,000 posts, there could not be 500 posts from January and 500 posts from December; a better distribution would be approximately 80 posts for each month of the year. This process is known as systematic sampling. Additionally, fields having a high variability and wide ranges, such as reputation and answer and question scores, were converted into numerical categories.

The categories used were high=1 and normal=0.

82 Process of elimination of blank, incomplete, or expired records.

50

Figure 3-5. Data preprocessing.

Second, natural language processing (NLP) and text analytics (see Figure 3-6) were applied over text-based columns, such as questions, answers, and comments. The textual content was also analyzed before modification. The number of words, word frequency (word cloud), and the number of characters were calculated for each post’s title and body. Moreover, a capitalization ratio was compiled by counting the total number of uppercase characters and dividing it by the total number of characters on each post. For instance, the word Hello, will have a capitalization ratio of 0.2.83 There are limitations to this metric, especially with SQL-related posts having sample code. SQL commands

83 There are a total of five characters. Only one uppercase character exists: H. There are four lowercase characters: e, l, l, and o. The rate is calculated as follows: 1 / 5 = 0.2.

51

(statements and functions) are usually written in uppercase, which could offset the

capitalization ratio.

Figure 3-6. Text analytics.

The ProWritingAid v2.0 software tool—grammar checker, style editor, and writing mentor—was manually used to calculate grammar and spelling scores.

Readability scores84 were calculated by looking at the numbers of syllables per word and

words per sentence, thus determining the educational grade-level of the writing; usually,

shorter words and sentences yield better results. Automating this process would have

involved a third-party integration with the text and grammar checking API,85 but that step

84 Readability measures should have a score greater than 60. Glue and transition indexes should be under 40% and over 25% respectively. Some of the metrics used were: Flesch-Kincaid Grade, Coleman-Liau, Automated Readability Index, Dale-Chall Grade, Flesch Reading Ease, and Dale-Chall Ease.

85 Source: https://ProWritingAid.com/en/App/API.

52 would fall outside the scope of the current praxis. (See Figures 3-7 and 3-8 for additional details regarding the ProWritingAid software.)

Figure 3-7. ProWritingAid toolbar in Microsoft Office 365 ProPlus.

Figure 3-8. ProWritingAid summary.

53

The next step was the removal of all articles (e.g., a, the) and “glue” words (e.g.,

some, much, just) from the text. Once the text was reduced, stemming was applied.

Stemming is a technique that reduces a word to its root form, also known as the lemma.

Stemming differs from lemmatization in that the root word obtained might not be an

actual known word in spoken human language. (The programmatic process that

accumulates the bags of words’ scores, using buzzwords and code tags, is described in

chapter 4.)

Third, post metadata (see Figure 3-9) was analyzed using available statistical techniques, specifically, regression via partial least squares and binary logistic regression.

Post metadata refers to timestamps and metrics from posts. Posts can appear as questions, answers, comments, or suggested edits.

Figure 3-9. Post metadata.

54

Fourth, author metadata (see Figure 3-10) was examined, as were activity and

reputational scores, geographical location, and the number of profile views at the time of the analysis.

Figure 3-10. Author metadata.

The framework also considered whether one or more solutions have high scores, determining if a code snippet is present, and counting the number of hyperlinks86 within

it. These three metrics, along with the analysis of the four major areas presented above,

will classify the solution as optimal or not.

86 Also known as URLs (uniform resource locators), or locations/addresses of internet resources.

55

Chapter 4 further elaborates on the nuances of NLP, PLS, and BLR, and introduces examples of how the different metrics were calculated, testing the proposed conceptual framework using three case studies.

56

Chapter 4: Results

4.1 Introduction

As an important step prior to collecting the data, the Stack Overflow’s Annual

Developer Survey Results for 2018 were reviewed. The survey was conducted by Stack

Overflow in January 2019 and included over 100,000 participants from 183 countries.87

The main takeaways from the survey were a wide adoption of DevOps, machine learning, and artificial intelligence techniques and their ethical implications, in addition to contrasting individual perspectives, and gender-determined88 goals within the Stack

Overflow community. Most users reside in the United States, Western Europe, and India, and the overwhelming majority of developers do coding as a hobby. Furthermore, more than 80% of developers rely on Stack Overflow when learning new programming strategies and techniques, a high level that should be considered with caution, given the survey’s participants. For that reason, survey data was not used as part of the statistical analysis.

87 Source: https://Insights.StackOverflow.com/survey/2018#overview.

88 Less than 10% of users at Stack Overflow are female, but these stats are slowly shifting.

57

Figure 4-1. Stack Overflow users’ age, educational level, and wake-up time.

As depicted in Figure 4-1, most surveyed users were between 25 and 45 years old, had a bachelor’s degree, and started working between 7:00 am and 9:00 am.89 This information was useful in determining the frequency of contributions and post quality on the forum and in noticing how most developers were below 46 years of age.90

Furthermore, Figure 4-2 shows a list of widely-used methodologies, platforms, databases, frameworks, and programming languages. This data was vital for selecting and

89 These times were adjusted to reflect the local time of the users answering the survey.

90 In most professions, tenure is more prevalent among employees who are between 55 and 64 years of age.

58

developing three appropriate case studies that could validate the proposed research

product’s predictive power.

Figure 4-2. Most popular technologies in Stack Overflow.

4.2 Data Collection and Preprocessing

After the main trends from the year 2018 were determined, it was easier to proceed with the process of data collection, cleaning, and transformation.

The Stack Exchange Data Explorer contains a query editor91 that retrieves post

metadata from Stack Overflow. The query used appears below and shows that the main

database tables used were Posts, Users, Comments, Suggested Edits, Post Tags, and Tags.

91 Source: https://Data.StackExchange.com/stackoverflow/query/new.

59

As previously mentioned, the data was limited to the year 2018, and the answers’ scores had to be less than -2 or greater than 5. All blank/NULL or incomplete records were removed from the data prior to analysis. The query, output, and execution plan from

SEDE are part of Figures 4-3, 4-4, and 4-5 respectively.

SELECT DISTINCT CASE WHEN p1.[AcceptedAnswerId] > 0 THEN '1' ELSE '0' END AS 'Answer_Accepted' , p.[Score] AS 'Answer_Score' , p.[CreationDate] AS 'Answer_Creation_Date' , p1.[Score] AS 'Question_Score' , p1.[CommentCount] AS 'Question_Number_Of_Comments' , p1.[ViewCount] AS 'Question_Number_Of_Views' , p1.[Tags] AS 'Question_Tags' , p.[Body] AS 'Answer_Text' , p.[CommentCount] AS 'Answer_Number_Of_Comments' , u.[Reputation] AS 'Author_Reputation' , u.[Views] AS 'Author_Views' , SUM(t.[Count]) AS 'Tag_Sum' , ISNULL(SUM(c.[Score]), 0) AS 'Comment_Sum' , ISNULL(COUNT(se.[Id]), 0) AS 'Suggested_Edits_Count' FROM [Posts] p INNER JOIN [Posts] p1 ON p.[ParentId] = p1.[Id] INNER JOIN [Users] u ON p.[OwnerUserId] = u.[Id] LEFT JOIN [Comments] c ON p.[Id] = c.[PostId] LEFT JOIN [SuggestedEdits] se ON p.[Id] = se.[PostId] INNER JOIN [PostTags] pt ON pt.[PostId] = p1.[Id] INNER JOIN [Tags] t ON t.[Id] = pt.[TagId] WHERE p.[PostTypeId] = 2 AND YEAR(p.[CreationDate]) = 2018 AND YEAR(p1.[CreationDate]) = 2018 AND (p.[Score] > 5 OR p.[Score] < ‐2) GROUP BY p.[Id], p.[CreationDate], p1.[Score], p1.[CommentCount] , p1.[ViewCount], p1.[Tags], p1.[AcceptedAnswerId], p.[Score] , p.[Body], p.[CommentCount], u.[Reputation], u.[Views] HAVING COUNT(se.[Id]) > 0 AND SUM(c.[Score]) > 0 ORDER BY p.[Score] DESC;

Figure 4-3. Query used for data collection in SEDE.

60

Figure 4-4. SEDE’s partial results.

Figure 4-5. Query’s partial execution plan.

Table 4-1 was generated in Microsoft Excel for Office 365 v16.0 32-bit. Fields with high variability (i.e., answer’s score and question’s number of views) were

61 converted into binary variables using a variation of the Calefato approach (Calefato

2018), described in the third chapter (methodology). All operations were completed on a

Dell XPS 8300 desktop computer with Windows 7 Professional, Service Pack 1, 16 gigabytes of RAM, and an Intel(R) Core(TM) i7-2600 3.40 gigahertz processor.

Table 4-1. Sample data for a single post

Columns / Metadata Original Transformed Microsoft Excel Formula Data Data Answer_Accepted 1 1 -

Answer_Score 3,264 1 =IF(CELL >

AVERAGE(Answer_Score) * 2,

1, 0)

Answer_Creation_Date 1/15/2018 3 MM/DD/YYYY

8:35:15 pm =WEEKNUM(CELL)

Question_Score 2,421 1 =IF(CELL >

AVERAGE(Question_Score) *

2, 1, 0)

Question_Number_Of_ 17 1 =IF(CELL >

Comments AVERAGE(Question_Number_

Of_Comments) * 2, 1, 0)

62

Columns / Metadata Original Transformed Microsoft Excel Formula Data Data Question_Number_Of_ 360,255 1 =IF(CELL >

Views AVERAGE(Question_Number_

Of_Views) * 2, 1, 0)

Question_Tags (renamed 24 Score calculated as

Bags_Of_Words_Score) -6>

Answer_Text (renamed

If you 1 = IF(LEN(CELL) < 2000, 1, 0) as take

Answer_Text_Length) advantage of

Answer_Number_Of_ 19 1 =IF(CELL >

Comments AVERAGE(Answer_Number_

Of_Comments) * 2, 1, 0)

Author_Reputation 87,398 1 =IF(CELL >

AVERAGE(Author_Reputation)

* 2, 1, 0)

Author_Views 21,365 1 =IF(CELL >

AVERAGE(Author_Views) * 2,

1, 0)

63

Columns / Metadata Original Transformed Microsoft Excel Formula Data Data Tag_Sum 107,472,957 1 =IF(CELL >

AVERAGE(Tag_Sum) * 2, 1, 0)

Comment_Sum 870 1 =IF(CELL >

AVERAGE(Comment_Sum) * 2,

1, 0)

Suggested_Edits_Count 114 1 =IF(CELL >

AVERAGE(Suggested_Edits_

Count) * 2, 1, 0)

Note. The CELL keyword varies depending on the value to be transformed. Averages are calculated for the entire referenced column.

Due to the high number of line breaks, commas, single-quotes, double-quotes, and

special characters within the answer’s body content column, an exhaustive data cleaning

process was conducted. One of the Excel formulas used is shown here:

=CLEAN(TRIM(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(H2, CHAR(13), ""),

CHAR(10), ";"), ",", ";"))). Not only did these operations make it easier to analyze the

data, but they also linearized all textual content, making it compatible with a comma- separated-values (CSV) file format. The encoding applied was UTF-8.92 Moreover, code

syntax, such as C++, which interfered with the compilation process, was substituted with

an acceptable value (e.g., C-plus-plus).

92 This character encoding method is a variation of the Unicode Transformation Format.

64

The calculation of BOW’s scores was implemented in

Professional 2019 v16.3 with the .NET Framework v4.8. The basic functionality,

summarized in Figure 4-6, required the creation of two CSV files, one containing the

original raw data and a second file for storing the values compiled in the utility C-sharp program.

A total of 1,485 lines were analyzed, excluding the header column, which was omitted from the analysis. Each line was expected to have 14 elements—one for each column displayed on Table 4-1. Only the elements below, however, were used for compiling BOW’ scores:

 Question_Tags (position 6,93 element 7)

 Answer_Text (position 7, element 8)

 Tag_Sum (position 11, element 12)

Using regular expressions and ignoring case sensitivity, for every matching tag

within the body of an answer, five points were added. If a code tag was present, ten

points were added. (Other data validation techniques and error handling implemented in

the program are not included as part of Figure 4-6.)

93 In C-sharp, element positions start at zero. For instance, the last position of an array with 10 elements would be 9.

65

string[] allLines = File.ReadAllLines(_rawDataFile, Encoding.UTF8); foreach (string line in allLines) { lineNumber++; if (lineNumber == 1) { continue; } string[] fields = line.Split(',');

if (fields.Length != 14) { File.AppendAllText(_textScoresFile, "0" + "\n"); continue; }

string questionTagList = fields[6]; string answerBody = fields[7]; string tagSum = fields[11]; string[] tagList = questionTagList.Split('>'); int count = 0; foreach (string tag in tagList) { string cleanTag = tag; cleanTag = cleanTag.Replace(">", string.Empty); cleanTag = cleanTag.Replace("<", string.Empty); if (!string.IsNullOrEmpty(cleanTag)) { foreach (Match match in Regex.Matches(answerBody, cleanTag, RegexOptions.IgnoreCase)) { count = count + 5; } } }

if (answerBody.ToLower().Contains("")) { count = count + 10; } File.AppendAllText(_textScoresFile, count.ToString() + "\n"); }

Figure 4-6. C-sharp functionality for calculating BOW scores.

The link between the data retrieved from the SEDE website and the quantitative methods applied is shown below in Figure 4-7 and further developed in the following section. Only SEDE data (and not survey information) was used to perform the statistical analysis.

66

Figure 4-7. Linking data source and quantitative methods.

4.3 Predictive Models

In the current section, the following two research hypotheses will be evaluated:

RH1: If an answer targets a question with a high number of comments, has less recommended edits than other posts within the same thread, and is succinct (has less than

2,000 characters), then the answer will have a high relevance score.

RH2: If an answer contains a high cumulative tag score and a significant word frequency (detected via word segmentation, bags of words, and stemming), then the

67

answer will more likely be selected as the asker’s preferred answer, thus becoming easier

to find.

The dataset was thoroughly evaluated to determine if it was evenly distributed

throughout the year 2018, a process known as systematic sampling. The total sample size

consisted of 1,485 records (n = 1,485; see Figure 4-8).

The box and whisker and histogram plots shown below prove the data was

adequate for further analysis. Post frequency is consistent from month to month, except

for a steady decline toward the end of the year. This can be explained by two main

factors: Many employees use accrued vacation time that was set to expire toward the end

of year, and most of the year’s holidays occur in the months of November and December

(i.e., Thanksgiving Eve, Thanksgiving Day, Christmas Eve, Christmas Day, and New

Year’s Eve).

68

Figure 4-8. Post frequency distribution per week.

The research questions posed in chapter one also served as the basis for the statistical analysis. As expected, the average score of accepted answers was 55.53%

69 higher than the average score of rejected ones.94 The standard deviation calculated for overall author’s reputation was 87,645,95 which shows how much variability is prevalent within Stack Overflow in terms of technical expertise and user engagement.

Similarly, almost three-quarters of answers96 contained at least one tag or buzzword from the question. The average number of comments per answer was 70.07% higher than the average number of comments per question.97 The range of suggested edits per chosen answer was 1 to 340,98 which hints at varying levels of post quality.

Quantitative analysis (multivariate regression) and, specifically, partial least squares with a confidence interval of 95% (α = 0.05) was conducted in Minitab v18.1.

The total number of principal components—not to be confused with significant variables or factors—specified was five.

94 For accepted answers, the formula used was =AVERAGEIF(Answer_Accepted, 1, Answer_Score), which returned an average score of 39.38. For rejected answers, the formula used was =AVERAGEIF(Answer_Accepted, 0, Answer_Score), which returned an average score of 25.32.

95 Excel formula: =STDEVA(Author_Reputation).

96 Excel formula: = 1,033 / COUNTA(Question_Tags). The numerator value of 1,033 was compiled in C- sharp by using an accumulator method that added one every time an answer’s body had at least one question’s tag: each post could only be added once. The resulting pairing or matching was 69.56%.

97 For answers, the formula used was =AVERAGE(Answer_Number_Of_Comments), which returned an average of 4.83. For questions, the formula used was =AVERAGE(Question_Number_Of_Comments), which returned an average score of 2.84.

98 The range’s floor or minimum value was calculated in Excel as follows: {=MIN(IF(Answer_Accepted = 1, Suggested_Edits_Count))}; whereas the ceiling or maximum value was calculated in the following fashion: {=MIN(IF(Answer_Accepted = 1, Suggested_Edits_Count))}.

70

A few researchers have used PLS for dealing with binary, dependent response variables. Some of the scenarios considered were:

 Rodríguez-Pérez (2018) demonstrated that PLS does not perform well while

analyzing factors having a high dimensionality or when applied to datasets

containing few records.

 Yin (2018) and Cao (2018) used discrete, modified binary particle swarm

optimization (BPSO) algorithms via PLS and adaptive models, such as remote

sensing.99

 Within a hurdle model,100 Zhang (2018) explored how a binomial framework

guided the dependent variables’ binary outcomes, which can be zero or positive.

 Sun (2019) predicted simulated phenotypical traits of different populations using

binary indicators in the response variable, which outperformed FST101 and

EigenGWAS102 methods.

 Conversely, Murase (2018) applied PLS to material discrimination of near-

infrared ray (NIR) band—stored as binary classifiers.

The first research hypothesis tries to predict when an answer might have a high relevance score. Nonetheless, the data collected was unbalanced because only 20.54% of answers (305 out of 1,485) had a high score. Random sampling was applied to the dataset

99 Remote sensing is a reliable tool for measuring [inland] water quality (Cao 2018).

100 A hurdle model is based on binary or before-and-after thresholds.

101 A fixation index is a score of population differentiation that evaluates genetic structures (Sun 2019).

102 Combination of genome-wide association studies and eigenvector decomposition (Sun 2019).

71

to ensure that the percentage of high-scoring answers approached 50%.103 After

randomization, the new sample had 305 high-scoring and 294 regular answers (n = 599,

50.92%).

The best predictive accuracy obtained using all independent variables approached

70%. To prevent overfitting, it was then decided to choose significant factors by looking

into standard coefficient and loading plots (see Figures 4-9 and 4-10). Upon careful

examination, the following six variables104 were selected, and the model was re-run with three components, getting a prediction accuracy of 73.96%:

 Question_Score

 Question_Number_Of_Views

 Author_Views

 Comment_Sum

 Suggested_Edits_Count

 Answer_Text_Length

103 The Excel formula used for randomizing and balancing the data was: =IF(CELL = 1, RANDBETWEEN(1, 80), RANDBETWEEN(81, 100)).

104 The first three independent variables were pulled from the coefficient and loading plots, whereas the remaining three were included from the first research hypothesis.

72

PLS Coefficient Plot (response is Answer_Score) 5 components

0.3

0.2

0.1 Coefficients 0.0

-0.1

1 2 3 4 5 6 7 8 9 10 11 12 Predictors

Figure 4-9. PLS standard coefficient plot for answer score.

PLS Loading Plot

0.4 Question_Number_Of_Views Answer_Text_Length 0.3 Question_Score 0.2

Answer_Accepted 0.1 Comment_Sum 0.0

-0.1 Author_Views Question_Number_Of_Comments Component 2 Component -0.2 Author_Reputation

Answer_Number_Of_Comments -0.3 Bags_Of_Words_Score

-0.4

-0.5 Tag_Sum Suggested_Edits_Count 0.0 0.1 0.2 0.3 0.4 0.5 Component 1

Figure 4-10. PLS loading plot for answer score.

73

The response plot is presented below in Figure 4-11. The resulting statistics are also included. Large residuals in the Y-axis are common in poor models, whereas X- residuals can identify outliers (Roy 2015).

PLS Response Plot (response is Answer_Score) 3 components

1.2

1.0

0.8

0.6

0.4 Calculated Response 0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Actual Response

Figure 4-11. PLS response plot for answer score.

Table 4-2. Analysis of Variance

Source DF SS MS F P

Regression 3 35.886 11.9620 65.24 0.000

Residual Error 595 113.814 0.1913

Total 598 149.699

74

Table 4-3. Model Selection and Validation

Components X Variance Error R-squared

1 0.366245 114.172 0.237327

2 0.544771 113.818 0.239690

3 0.662258 113.814 0.239720

Table 4-4. Coefficients

Answer_Score Standardized

Answer_Score

Constant 0.323585 0.000000

Question_Score 0.255689 0.205350

Question_Number_Of_Views 0.226486 0.181333

Author_Views -0.078108 -0.042821

Comment_Sum 0.207882 0.157511

Suggested_Edits_Count 0.134519 0.101144

Answer_Text_Length 0.043785 0.033549

Minitab’s fit coefficients were used to determine the predicted value via a variable

cutoff of threshold. Predicted values were then compared to actual values, and an

accuracy score was compiled. For example, out of 599 records, 443 of them were correctly predicted, for a 73.96% accuracy score. For easier testing, minimum and maximum fit coefficients were retrieved, and a histogram was built using fit coefficients.

75

This contributed to a better understanding of the dynamics and behaviors guiding the prediction model.

The results seemed to confirm the first hypothesis, given that the answer’s comment and character count, besides suggested edits and other related factors like question score and author reputation, were found to be significant.

Unsurprisingly, questions’ scores and the number of views have a positive influence on good answers. Good, relevant questions have higher demand and receive more traffic than do poor, irrelevant questions; hence, there is a greater number of users who can upvote the solutions within the thread.

An interesting finding is that author’s views have a negative correlation on answer scores. A plausible explanation could be that, when trying to offer feedback for an erroneous answer, knowledgeable users might visit the author’s page to give their criticism in a direct, yet private fashion.

While the first research hypothesis predicts high-scoring answers (community’s best answer), the second hypothesis seeks to predict whether an answer is chosen by the individual user who posed the question. In this case, due to a lack of cross-validation controls and to the existence of a myriad of subjective factors guiding individual user behavior, one may anticipate the results to be less reliable than those obtained for the first hypothesis: This is especially relevant if the prediction accuracy is greater than the one obtained in the first hypothesis.

76

Once again, the data for chosen or accepted answers was skewed, and 77.85% of

the questions were marked as accepted (1,156 posts out of 1,485). Random sampling was

conducted, and the resulting dataset contained only 332 chosen answers (n = 639,

51.96%).105

The statistical method used for analysis was BLR (i.e., the answer is selected or

not), with Logit as the link function and a two-sided, confidence level of 95% and

Pearson residuals. All generated plots were standardized.

The five significant factors are displayed below:

 Answer_Score

 Comment_Sum

 Suggested_Edits_Count

 Bags_Of_Words_Score

 Tag_Sum

105 The Excel formula used was: =IF(CELL = 1, RANDBETWEEN(1, 20), RANDBETWEEN(21, 100)). CELL refers to individual values within the Answer_Accepted column.

77

Table 4-5. Deviance table

Source DF Adj. Dev. Adj. Mean Chi-Square P-Value

Regression 5 280.901 56.180 280.90 0.000

Answer_Score 1 205.182 205.182 205.18 0.000

Comment_Sum 1 10.021 10.021 10.02 0.002

Suggested_Edits_Count 1 54.669 54.669 54.67 0.000

Bags_Of_Words_Score 1 2.345 2.345 2.35 0.126

Tag_Sum 1 2.537 2.537 2.54 0.111

Error 633 603.963 0.954

Total 638 884.864

Table 4-6. Model Summary

Deviance R-squared Deviance R-squared (adjusted) AIC

31.75% 31.18% 615.96

Table 4-7. Coefficients

Term Coefficient SE Coefficient VIF

Constant -1.14200 0.14100

Answer_Score 2.99300 0.24600 1.28

Comment_Sum -1.15700 0.36300 1.35

Suggested_Edits_Count 2.41900 0.35800 1.62

Bags_Of_Words_Score -0.00200 0.00127 1.04

Tag_Sum -0.52300 0.33100 1.58

78

Table 4-8. Goodness-of-fit Tests

Test DF Chi-Square P-Value

Deviance 633 603.96 0.791

Pearson 633 659.21 0.228

Hosmer-Lemeshow 8 14.97 0.060

The prediction accuracy of chosen answers was 80.91% (517 correct predictions

out of 639 total records). Even though it was possible to predict chosen answers

successfully, the bags of words and cumulative tag scores were not as significant as the

answer score and the suggested number of edits. More research might be needed to

explore the significance of bounties and badges in answer selection by users who pose

questions.

The regression equation describing the model used to predict answer selection is

the following:

P(1) = exp(Y') / (1 + exp(Y'))

Y' = -1.142 + 2.993 Answer_Score - 1.157 Comment_Sum + 2.419 Suggested_Edits_Count

- 0.00200 Bags_Of_Words_Score - 0.523 Tag_Sum

With the confirmation of the first two hypotheses, the next step consists of

validating the research product. As described in the following section, the initial two hypotheses serve as the foundation for implementing the case studies.

79

4.4 Case Studies

In this section, the research product is tested by applying it to three case studies, spanning .NET-, API-, and JavaScript-related questions. The third hypothesis is the following:

RH3: If the model outlined in RH1 and RH2 is used in three case studies, more efficient/relevant data acquisition will occur, and Stack Overflow’s high-scoring topics will be easier to find (search optimization).

Figure 4-12 shows the link between the two hypotheses covered in the previous section and the third hypothesis developed in the current one.

Figure 4-12. Relationship model between research hypotheses and case studies.

80

The first case study consisted of finding a Stack Overflow question related to

.NET via the Stack Exchange Data Explorer or Google search engine. Hence, the following inquiry was found: do we use double equal sign (==) or .Equals() for comparing strings in C#?106 All four areas described in the methodology chapter were applied.

 Data preprocessing:

o Converted answer scores into categories (high: > 50, normal: >10, and low:

<= 10): Out of 16 answers, 11 were low, 3 were normal, and 2 were high.

o Converted question scores into categories (high: > 50, normal: >10, and low:

<= 10): Since a single question was analyzed and its score was 491, it was

tagged as high.

 Text analytics:

o Compiled the cumulative BOW score (unique tags plus code tags). There were

no code tags found within the optimal answer (highest score and asker’s

choice), but the BOW score was 10 because the equals tag appeared twice.

Standard tags (e.g., c#, .net, equals) add five points every time they appear,

whereas code tags add 10 points per occurrence.

o Calculated the number of characters per post’s body and title: Counting

spaces, the question’s title had 37 characters, and its body had 393 characters.

o Generated the number of words per post and calculated word frequency:

There were 6 words in the title and 55 words in the body.

106 Source: https://StackOverflow.com/questions/814878/c-sharp-difference-between-and-equals.

81

o Capitalization (uppercase) ratio: Out of 235 characters in the answer, only 6

were uppercase characters: 6 / 235 = 2.55%.

o Removed articles, stop and “glue” words, and applied stemming:107 After

using the NLTK (Natural Language Toolkit) in Python, the following words

remained: ==, expression, type, System.Object.ReferenceEquals.Equals,

virtual, method, override, version, string, comparison, and content (11 words

remaining out of 38 original words).

o Grammar/spelling score: Using ProWritingAid, the grammar score was

determined to be 22/100 while the spelling score was 35/100.

 Post metadata:

o Day of the week: Saturday, 05/02/2009.

o Time of the day (morning, afternoon, evening, etcetera): Afternoon, 13:39 or

1:39 pm.

o Number of answers/comments/edits and question/answer/comment score:

There were a total of 16 answers, 45 comments, and 11 recommended edits.

The question had a total score of 491, whereas the best answer had a score of

395, and the best comment had a score of 49.

 Author metadata:

o Activity score: 8,590 actions.

o Reputation: 351,765.

o Geographical location: California, United States.

o Number of profile views: 51,630.

107 Stemming consists of reducing a given word to its root form.

82

 Other:

o High score: Yes.

o Code snippet: No.

o Number of URLs: 2.

Out of the 18 conditions listed, a total of 16 were met, for an 88.89 combined score. The algorithm determined that the answer selected by the asker and the community was indeed the best/optimal one.

The remaining two case studies are presented in tabular form (see Table 4-9).

83

Table 4-9. Case studies in API and JavaScript

Conditions / Case Studies Case Study 2 – API108 Case Study 3 - JavaScript109

Totals 12 / 18 = 66.67% 14 / 18 = 77.78%

Answer categories Low: 3 High: 2

Question categories Low: 1 High: 1

BOW score 10 20

Character count 206 671

Word count 40 116

Capitalization ratio 3 / 206 = 1.46% 19 / 671 = 2.83%

Words after using NLTK 16 23

Grammar and spelling scores 10; 10 21; 11

Day of the week Saturday Tuesday

Time of the day Afternoon Morning

Comment and edit metadata 1; 0 10; 0

Author activity score 234 1,204

Reputation 343 18,775

Geographical location Singapore, Singapore Klagenfurt, Austria

Number of profile views 252 1,125

High score No Yes

Code snippet No Yes

Number of URLs 2 2

Note. Underlined cell values mean that other answers within the thread met optimal criteria. Both cumulative percentages are consistent with the expected results and post quality. For instance, even though both answers were chosen by the asker, the lower scores in the API solution are correctly reflected on the lesser total percentage.

108 Source: https://StackOverflow.com/questions/5595334/paypal-integration.

109 Source: https://StackOverflow.com/questions/12797118/how-can-i-declare-optional-function- parameters-in-javascript.

84

As proven by the implementation and analysis of the three case studies, the proposed research product could increase the speed of knowledge transfer within Stack

Overflow, a fact that could optimize the platform’s searching mechanisms.

85

Chapter 5: Discussion and Conclusions

5.1 Discussion

Stack Overflow meshes (or “plug-and-plays”) two important techniques. First, it

uses a Q&A format without following a traditional discussion forum paradigm; second,

its signature layout resembles that of Wiki-based platforms in that moderators are chosen

by the user community and interactions depend on collaborative trust. Pivotal to that

communal trust are reputational scores, a metric that becomes relevant because all

developers within the site are regarded as peers. Stack Overflow adopts the role of a lingua franca, a bridge connecting developers.

Stack Overflow, however, reveals a major paradox, contrasting the interests and goals of individuals with those of the community. Such interests and goals are not always aligned and, in some instances, are mutually exclusive. For instance, since only one person can mark an answer as accepted, selected/chosen answers go on longer without being marked as resolved.

5.2 Conclusions

All three research hypotheses were confirmed. Firstly, PLS analysis was able to predict high-scoring answers with a 73.96% accuracy. Secondly, chosen answers could be correctly predicted 80.91% of the time via BLR. Thirdly, the three case studies completed in the previous chapter showed encouraging results and opened the possibility for further research and possible automation of the research product. These outcomes

86

could be useful for optimizing search inquiries110 for the millions of developers who

access Stack Overflow every day.

The leading factors that prevented the improvement of prediction accuracy percentages were the massive amount of duplicate information within the site, the variability of answerers’ expertise on the subject, and the responders’ competence in the use of the English language. Moreover, there are many questions that are not duplicates; however, they are similar enough to cause confusion among users and negatively affect post quality. Such drawbacks limit Stack Overflow’s ability to attract a greater number of high-quality posts.

As shown by Stack Overflow, proper knowledge transfer in the area of software development defies and possibly defeats the law of economic scarcity. A technical-savvy, cross-trained team is more productive than a team pervaded by siloes of expertise.

Listening to the Stack Overflow community could inform business processes, company planning, and even stock valuations that are heavily influenced by emerging programming techniques or breakthroughs. Without a doubt, adequate knowledge transfer must be a vital component of every IT department. When properly applied, knowledge transfer and intellectual capital can become an industry differentiator, a driver for innovation and a means of achieving competitive advantage (Bock and Kim 2002).

Stack Overflow can guide this innovation from the bottom-up (i.e., from developers to managers to executives).

110 This can be accomplished by translating scoring mechanisms into sitemaps and SEO (search engine optimization) within Stack Overflow.

87

5.3 Contributions to Body of Knowledge

This praxis contributes to the body of knowledge in two important ways:

1. By introducing the definition of optimal answer region, the nature and outcome

of the statistical analysis are more holistic than previous research conducted on

this area.

2. By building upon existing knowledge management techniques and knowledge

transfer methodologies, the proposed research product facilitates finding optimal

answers within Stack Overflow, with approximately 80% prediction accuracy.

5.4 Recommendations for Future Research

The amount and quality of the information found in Stack Overflow could inspire future research in several areas. A few recommendations are outlined below:

 Implementation and automation of conceptual frameworks related to knowledge

transfer, knowledge diffusion, and talent acquisition

 Analysis of online forum’s dynamics from a sociological perspective

Of primary importance, the conceptual framework developed here could be built into a configurable software tool for increased predictive power for locating high-scoring answers within Stack Overflow. Hence, researchers could expand the methodologies used in this praxis by scaling the dataset and using more up-to-date information. Also, they could implement a software tool to automate the processes of data collection, cleaning, and analysis.

88

Second, a case study could be conducted to evaluate the effectiveness of

implementing Stack Overflow for Teams111 in a development setting, especially if the knowledge managed is sensitive, privileged, or confidential. Stack Overflow for Teams is a private, cloud-based, searchable, knowledge management platform with a centralized source-of-truth that seamlessly integrates all capabilities from the Stack Overflow environment into companies. The information is handled reliably and securely for little over $200 per user per annum, driving internal team collaboration and lowering overhead costs. Furthermore, chatting tools like Slack can be easily integrated as part of Stack

Overflow for Teams to track article updates and notify users when new questions and answers become available. According to claims by Stack Overflow, employees using the platform can save up to 20 hours per month; internal support requests are resolved 20% faster, and, most importantly, solutions are stored for a future consult in a knowledge base format. In fact, companies like Expensify112 have successfully implemented Stack

Overflow for Teams.

Third, research could be conducted in Stack Overflow at the user level to

determine areas of expertise and areas requiring additional training for a given developer.

For instance, one could design an m times n knowledge matrix for finding subject area

experts. Rows would be represented by the letter m, and they would contain the names or

aliases of all engineers. Columns would be represented by the letter n, and they would

111 Source: https://StackOverflow.com/teams.

112 Expensify is a software company offering expense management systems for individuals and companies.

89

store a running score per each software product, interface, or programming language.113

A Likert scale ranging from zero to five would be used (zero being no knowledge in a given area, and five being the highest plausible, attainable expertise). The knowledge matrix could also include contact details and availability and would work similarly to yellow pages (Delen and Al-Hawamdeh 2009). A good methodology for analyzing this data could be cluster analysis, which groups sets of data into a cluster or collection of

data objects through unsupervised classification with the objective of finding high intra-

class similarity and low inter-class similarity while detecting hidden patterns.

Fourth, an analysis of Stack Overflow’s data could be invaluable to Human

Resources (HR) departments and hiring/recruiting agencies. Hiring good programmers is an extremely challenging and tedious task, made so by the existence of a candidate- driven market, pervaded by an imbalance between a high number of open positions

(demand) and a low number of available, reliable developers (supply). By researching technological and industry trends found within the site, technical recruiters could determine which tools are in greater demand (e.g., most questions from the month of June

2019 are related to the Rust programming language) and find knowledgeable users within a geographical area for immediate screening.114 HR teams could also look at the

aforementioned data to guide their staffing initiatives and fine tune their onboarding

programs to decrease turnover rates. Indeed, with over 50 million developers visiting

113 It would be useful to add a dictionary with detailed descriptions for each topic or tool.

114 This process can be also regarded as recruiting advertising and/or branding.

90

Stack Overflow every month, the platform represents an excellent recruiting115 alternative

via the Careers 2.0 initiative.

Fifth, sociological studies could be conducted to evaluate the factors leading to

site popularity, user reputation, and community interaction within online forums. Stack

Overflow is part of a bigger platform called Stack Exchange. Stack Exchange expands its

customer base by using the Area 51 approach, a method in which approximately two hundred core users on a given discipline are recruited and surveyed with the goal of branching out into a new site targeting specialized users (e.g., chefs or video editors).

Essentially, only an average of 10% of all feedback is actionable. Studying social dynamics within individual sites and how they relate to each other could help in understanding the information flows guiding how online communities grow and evolve.

Similarly, researchers could determine what kind of personality traits are prevalent in

Stack Overflow, thus assessing the level of diversity within the community and what type of role models users seek (i.e., factors driving peer recognition).

115 Sources: https://www.StackOverflowBusiness.com/talent/platform, https://StackOverflow.blog/2011/02/23/careers-2-0-launches/, and https://StackOverflow.com/jobs/get- started. Careers 2.0 recommends portfolio-based over résumé-based screening.

91

References

Abdalkareem, Rabe, Emad Shihab, and Juergen Rilling. 2017. "What Do Developers Use the

Crowd For? A Study Using Stack Overflow." IEEE Software 34 (2): 53-60.

Akshatha S., S. Senthil Ganesh. 2017. "Exploring the Determinants of Exit Experience: Results

from the Survey of Ex-Employees in India." Coimbatore. International Conference on

Communication and Signal Processing.

Alawneh, Ali Ahmad, and Rashad Aouf. 2016. "A Proposed Knowledge Management

Framework for Boosting the Success of Information Systems Projects." IEEE

Engineering and MIS (ICEMIS) 1-5.

Amiryany, Nima, and Jeanne W. Ross. 2014. "Acquisitions that Make your Company Smarter."

MIT Sloan Management Review 55 (2): 13.

https://SloanReview.Mit.edu/article/acquisitions-that-make-your-company-smarter/.

Attiaoui, Dorra, Arnaud Martin, and Boutheina Ben Yaghlane. 2017. "Belief Measure of

Expertise for Experts Detection in Question Answering Communities: Case Study Stack

Overflow." Procedia Computer Science 112: 622-631.

Babcock, Pamela. 2004. "Shedding Light on Knowledge Management." HR Magazine 49 (5): 46-

51. https://www.Shrm.org/hr-today/news/hr-magazine/pages/0504covstory.aspx.

Benabou, Charles, and Raphaël Benabo. 1999. "Establishing a Formal Mentoring Program for

Organization Success." National Productivity Review: The Journal of Organizational

Excellence 19 (4): 7-14.

Bock, Gee W., and Young-Gul Kim. 2002. "Breaking the Myths of Rewards: An Exploratory

Study of Attitudes about Knowledge Sharing." Information Resource Management

Journal 15 (2): 14-21.

92

Bureau of Labor Statistics, United States Department of Labor. 2012. "Employee Tenure." Web.

https://www.Bls.gov/news.release/archives/tenure_09182012.pdf.

Calefato, Fabio, Filippo Lanubile, and Nicole Novielli. 2018. "How to Ask for Technical Help?

Evidence-Based Guidelines for Writing Questions on Stack Overflow." Information and

Software Technology 94: 186-207. http://Dx.Doi.org/10.1016/j.infsof.2017.10.009.

Cancialosi, Chris. 2017. Six Key Steps to Influencing Effective Knowledge Transfer in your

Business. Accessed November 22, 2018.

https://www.Forbes.com/sites/chriscancialosi/2014/12/08/6-key-steps-to-influencing-

effective-knowledge-transfer-in-your-business/#77b27e945fe6.

Cao, Yin, Yuntao Ye, Hongli Zhao, Yunzhong Jiang, Hao Wang, Yizi Shang, and Junfeng Wang.

2018. "Remote Sensing of Water Quality Based on HJ-1A HSI Imagery with Modified

Discrete Binary Particle Swarm Optimization-Partial Least Squares (MDBPSO-PLS) in

Inland Waters: A Case in Weishan Lake." Ecological Informatics 44: 21-32.

Chang, Young Bong, and Vijay Gurbaxani. 2012. "Information Technology Outsourcing,

Knowledge Transfer, and Firm Productivity: An Empirical Analysis." MIS Quarterly:

Management Information Systems 36 (4): 1043-1063.

Changchit, Chuleeporn. 2003. "An Investigation Into the Feasibility of Using an Internet-Based

Intelligent System to Facilitate Knowledge Transfer." Journal of Computer Information

Systems 43 (4): 91-99.

Chesbrough, Henry W. 2003. "A Better Way to Innovate." Harvard Business Review 81 (7): 12-

3.

Chua, Alton Y. K., and Snehasish Banerjee. 2015. "Answers or No Answers: Studying Question

Answerability in Stack Overflow." Journal of Information Science 41 (5): 720-731.

93

Davenport, Thomas H. 1997. "Ten Principles of Knowledge Management and Four Case

Studies." Knowledge and Process Management 4 (3): 187-208.

Delen, Dursun, and Suliman Al-Hawamdeh. 2009. "A Holistic Framework for Knowledge

Discovery and Management." Communications of the ACM 52 (6): 141-145.

DeMeyer, Arnoud. 1991. "Tech Talk: How Managers are Stimulating Global R and D

Communication." Massachusetts Institute of Technology, Sloan Management Review 32

(3): 49-58.

DeSimone, L. D., George N. Hatsopoulos, William F. O'Brien, Bill Harris, and Charles P. Holt.

1995. "How Can Big Companies Keep the Entrepreneurial Spirit Alive?" Harvard

Business Review 73 (6): 183-189.

Dillon, Peter C., William K. Graham, and Andrea L. Aidells. 1972. "Brainstorming on a Hot

Problem: Effects of Training and Practice on Individual and Group Performance."

Journal of Applied Psychology 56 (6): 487-490.

Garbade, Michael J. 2018. A Simple Introduction to Natural Language Processing. Accessed 09

06, 2019. https://BecomingHuman.ai/a-simple-introduction-to-natural-language-

processing-ea66a1747b32.

Geiger, Adrianne H. 1994. "Measures for Mentors." The Training and Development Sourcebook

46 (2): 65–67.

Gist, Marilyn E., Catherine Schwoerer, and Benson Rosen. 1989. "Effects of Alternative Training

Methods on Self-Efficacy and Performance in Computer Software Training." Journal of

Applied Psychology 74 (6): 884-891.

Guo, Yuchen, Guiguang Ding, Yuqi Wang, and Xiaoming Jin. 2016. "Active Learning with

Cross-Class Knowledge Transfer." AAAI. Beijing. 1624-1630.

94

He, Ming, Jiuling Zhang, and Jiang Zhang. 2017. "MINDTL: Multiple Incomplete Domains

Transfer Learning for Information Recommendation." China Communications 14 (11):

218-236. doi:10.1109/CC.2017.8233662.

He, Ming, Jiuling Zhang, Peng Yang, and Kaisheng Yao. 2018. "Robust Transfer Learning for

Cross-Domain Collaborative Filtering Using Multiple Rating Patterns Approximation."

In Proceedings of the Eleventh ACM International Conference on Web Search and Data

Mining. ACM. 225-233.

Ihrig, Martin, and Ian MacMillan. 2015. "Managing your Mission-Critical Knowledge." Harvard

Business Review 93 (1).

Iyengar, Kishen, Jeffrey R. Sweeney, and Ramiro Montealegre. 2015. "Information Technology

Use as a Learning Mechanism: The Impact of IT Use on Knowledge Transfer

Effectiveness, Absorptive Capacity, and Franchisee Performance." MIS Quarterly:

Management Information Systems 39 (3): 615-641.

Jacobs, Susan. 2018. Developing a Strategy to Facilitate Knowledge Transfer. Accessed

November 22, 2018. https://www.LearningSolutionsMag.com/articles/developing-a-

strategy-to-facilitate-knowledge-transfer.

Jiang, Wenhao, Wei Liu, and Fu-Lai Chung. 2018. "Knowledge Transfer for Spectral Clustering."

Pattern Recognition 81: 484-496. https://Doi.org/10.1016/j.patcog.2018.04.018.

Joshi, Kshiti D., Saonee Sarker, and Suprateek Sarker. 2007. "Knowledge Transfer Within

Information Systems Development Teams: Examining the Role of Knowledge Source

Attributes." Decision Support Systems 43 (2): 322-335.

95

Kalpic, Brane. 2008. "Why Bigger Is Not Always Better: The Strategic Logic of Value Creation

Through Mergers and Acquisitions." Journal of Business Strategy 29 (6): 4-13. Accessed

January 22, 2018. https://Search-ProQuest-com.ProxyGw.wRlc.org/docview/202682959.

Khandelwal, Vijay K., and Petter Gottschalk. 2003. "Information Technology Support for

Interorganizational Knowledge Transfer: An Empirical Study of Law Firms in Norway

and Australia." Information Resources Management Journal 16 (1): 14–23.

Kim, Junghwan, Jaeki Song, and Donald R. Jones. 2011. "The Cognitive Selection Framework

for Knowledge Acquisition Strategies in Virtual Communities." International Journal of

Information Management 31 (2): 111-120.

Ko, Dong-Gil. 2010. "Consultant Competence Trust Does Not Pay Off, But Benevolent Trust

Does! Managing Knowledge with Care." Journal of Knowledge Management 14 (2):

202-213.

Koman, Gabriel, and Jana Kundrikova. 2016. "Application of Big Data Technology in

Knowledge Transfer Process Between Business and Academia." Procedia Economics

and Finance. 605-611.

Kowalik, Grzegorz, and Radoslaw Nielek. 2016. "Senior Programmers: Characteristics of Elderly

Users from Stack Overflow." International Conference on Social Informatics. Cham:

Springer. 87-96.

Laudon, Kenneth C., and Jane P. Laudon. 2006. "Managing Knowledge in the Digital Firm." In

Management Information Systems: Managing the Digital Firm, 563-618. Pearson.

Masudur-Rahman, Mohammad, and Chanchal K. Roy. 2018. "An Insight into the Unresolved

Questions at Stack Overflow." arXiv.

96

Matei, Sorin Adam, Amani Abu Jabal, and Elisa Bertino. 2018. "Social-Collaborative

Determinants of Content Quality in Online Knowledge Production Systems: Comparing

Wikipedia and Stack Overflow." Social Network Analysis and Mining 8 (1): 36.

Murase, Kimiya, and Kunihito Kato. 2018. "Near-IR Material Discrimination Method by Using

Multidimensional Response Variables PLS Regression Analysis." In 2018 International

Workshop on Advanced Image Technology (IWAIT). IEEE. 1-4.

National Science Foundation. 2017. Chapter 8. Invention, Knowledge Transfer, and Innovation.

Accessed November 22, 2018.

https://www.Nsf.gov/statistics/2018/nsb20181/report/sections/invention-knowledge-

transfer-and-innovation/knowledge-transfer.

Nowlan, Michael F., and M. Brian Blake. 2007. "Agent-Mediated Knowledge Sharing for

Intelligent Services Management." Information Systems Frontiers 9 (4): 411-421.

doi:10.1007/s10796-007-9043-6.

Oliveira, Nigini, Michael Muller, Nazareno Andrade, and Katharina Reinecke. 2018. "The

Exchange in Stack Exchange: Divergences Between Stack Overflow and its Culturally

Diverse Participants." Proceedings of the ACM on Human-Computer Interaction. 1-22.

https://Doi.org/10.1145/3274399.

Quick MBA. 2010. Open Innovation - Porter's Generic Strategies. Accessed December 10, 2018.

http://www.QuickMba.com/entre/open-innovation/.

Ragkhitwetsagul, Chaiyong, Jens Krinke, and Rocco Oliveto. 2017. Awareness and Experience of

Developers to Outdated and License-Violating Code on Stack Overflow: An Online

Survey. arXiv preprint arXiv:1806.08149vl [cs.SE]. UCL Computer Science Research

Notes.

97

Raytheon Professional Services LLC. 2012. Onboarding and Knowledge Transfer. Training

Industry, Incorporated.

https://TrainingIndustry.com/content/uploads/2017/07/onboarding-and-knowledge-

transfer-report.pdf.

Roberts, Joanne. 2000. "From Know-How to Show-How? Questioning the Role of Information

Communication Technologies in Knowledge Transfer." Technology Analysis and

Strategic Management 12 (4): 429-433. doi:10.1080/713698499.

Rodríguez-Pérez, Raquel, Luis Fernández, and Santiago Marco. 2018. "Overoptimism in Cross-

Validation When Using Partial Least Squares-Discriminant Analysis for Omics Data: A

Systematic Study." Analytical and Bioanalytical Chemistry 410 (23): 5981-5992.

Rohrbach, Marcus, Michael Stark, and Bernt Schiele. 2011. "Evaluating Knowledge Transfer and

Zero-Shot Learning in a Large-Scale Setting." In Computer Vision and Pattern

Recognition (CVPR) - IEEE Conference 1641-1648.

Roy, Kunal, Supratik Kar, and Rudra Narayan Das. 2015. "Selected Statistical Methods in

QSAR." Chap. 6 in Understanding the Basics of QSAR for Applications in

Pharmaceutical Sciences and Risk Assessment. Academic Press.

Santhanam, Radhika, Larry Seligman, and David Kang. 2007. "Post-Implementation Knowledge

Transfers to Users and Information Technology Professionals." Edited by Gatton College

of Business and Economics, University of Kentucky School of Management. Journal of

Management Information Systems 24 (1): 171-199.

https://www.Jstor.org/stable/40398886.

Sarker, Saonee, Suprateek Sarker, Darren B. Nicholson, and Kshiti D. Joshi. 2005. "Knowledge

Transfer in Virtual Systems Development Teams: An Exploratory Study of Four Key

98

Enablers." IEEE Transactions on Professional Communication 48 (2): 201-218.

doi:10.1109/TPC.2005.849650.

Scandura, Terri A. 1992. "Mentoring and Career Mobility: An Empirical Investigation." Journal

of Organizational Behavior 13 (2): 169-174.

Schaufeld, Jerry. 2015. In Commercializing Innovation: Turning Technology Breakthroughs Into

Products, 166. New York: Apress.

Shahbandi, Saeed Gholami, Martin Magnusson, and Karl Iagnemma. 2018. "Nonlinear

Optimization of Multimodal Two-Dimensional Map Alignment with Application to Prior

Knowledge Transfer." IEEE Robotics and Automation Letters 3 (3): 2040-2047.

Simon, Herbert Alexander. 1971. Designing Organizations for an Information-Rich World. Vols.

Martin Greenberger: Computers, Communication, and the Public Interest. Baltimore: The

Johns Hopkins Press.

Simon, Steven J., Varun Grover, James T. C. Teng, and Kathleen Whitcomb. 1996. "The

Relationship of Information System Training Methods and Cognitive Ability to End-User

Satisfaction, Comprehension, and Skill Transfer: A Longitudinal Field Study."

Information Systems Research 7 (4): 466–490.

Slaughter, Sandra A., and Laurie J. Kirsch. 2006. "The Effectiveness of Knowledge Transfer

Portfolios in Software Process Improvement: A Field Study." Information Systems

Research 17 (3): 301-320. https://Doi.org/10.1287/isre.1060.0098.

Starcevich, Matt, and Friend, Fred. 1999. "Effective Mentoring Relationships from the Mentee's

Perspective." Workforce 2-3.

99

Sun, Hao, Zhe Zhang, Babatunde Shittu Olasege, Zhong Xu, Qingbo Zhao, Peipei Ma, Qishan

Wang, and Yuchun Pan. 2019. "Application of Partial Least Squares in Exploring the

Genome Selection Signatures Between Populations." Heredity 122 (3): 288-293.

Swap, Walter, Dorothy Leonard, Mimi Shields, and Lisa Abrams. 2001. "Using Mentoring and

Storytelling to Transfer Knowledge in the Workplace." Journal of Management

Information Systems 18 (1): 95-114. doi:10.1080/07421222.2001.11045668.

Szulanski, Gabriel, Dimo Ringov, and Robert J. Jensen. 2016. "Overcoming Stickiness: How the

Timing of Knowledge Transfer Methods Affects Transfer Difficulty." Organization

Science 27 (2): 304-322. https://Doi.org/10.1287/orsc.2016.1049.

Vasilescu, Bogdan, Vladimir Filkov, and Alexander Serebrenik. 2013. "Stack Overflow and

GitHub: Associations Between Software Development and Crowdsourced Knowledge."

Social Computing (SocialCom) - IEEE 188-195.

Wang, Bo, Ning Zhang, Quan Lin, Songcan Chen, and Yuhua Li. 2010. "Semantic-Oriented

Knowledge Transfer for Review Rating." Tsinghua Science and Technology 15 (6): 633-

641.

Yang, Di, Aftab Hussain, and Cristina Videira Lopes. 2016. "From Query to Usable Code: An

Analysis of Stack Overflow Code Snippets." ACM 391-402.

doi:10.1145/2901739.2901767.

Yin, Cao, Ye Yuntao, and Zhao Hongli. 2018. "Satellite Hyperspectral Retrieval of Turbidity for

Water Source Based on Discrete Particle Swarm and Partial Least Squares." Transactions

of the Chinese Society for Agricultural Machinery 49 (1): 173-182.

Zhang, Jingxuan, He Jiang, Zhilei Ren, and Xin Chen. 2018. "Recommending APIs for API

Related Questions in Stack Overflow." IEEE Access 6: 6205-6219.

100

Zhang, Michael J. 2007. "An Empirical Assessment of the Performance Impacts of Information

Systems Support for Knowledge Transfer." Edited by Murray E. Jennex. International

Journal of Knowledge Management (Idea Group Publishing) 3 (1): 66-85.

Zhang, Xinmin, Manabu Kano, Masahiro Tani, Junichi Mori, Junji Ise, and Kohhei Harada. 2018.

"Hurdle Modeling for Defect Data with Excess Zeros in Steel Manufacturing Process."

IFAC - Papers On Line 51 (18): 375-380.

Zhang, Yun, David Lo, Xin Xia, and Jian-Ling Sun. 2015. "Multi-Factor Duplicate Question

Detection in Stack Overflow." Journal of Computer Science and Technology 30 (5): 981-

997. doi:10.1007/s11390-015-1576-4.

101

Appendix A

Table A-1. SEDE’s sample data for the First Hypothesis

Variables DV: IV: IV: IV: IV: IV: IV:

Answer_ Question_ Question_ Answer_ Author_ Comment_ Suggested_

Score Score Number_ Text_ Views Sum Edits_

Of_Views Length Count

1 3,264 2,421 360,255 1,593 21,365 870 114

2 1,127 792 218,788 11,362 3,645 435 115

3 1,123 651 184,312 783 132 360 75

4 863 479 229,164 531 406 360 116

5 844 241 65,835 446 30 75 18

… … … … … … … …

1,481 -4 17 16,319 429 256 14 4

1,482 -5 24 2,089 797 132 3 3

1,483 -7 11 16,199 159 32 24 12

1,484 -7 2 600 2,172 71 4 11

1,485 -8 -3 127 519 83 21 15

Note. DV means dependent variable (Y-axis). IV means independent variable (X axis containing confirmed predictors). The table displays the first five and the last five records collected. All data is shown before transformation.

102

Table A-2. SEDE’s sample data for the Second Hypothesis

Variables DV: IV: IV: IV: IV: IV:

Answer_ Answer_ Bags_Of_ Tag_Sum Comment_ Suggested_Edits_

Accepted Score Words_Score Sum Count

1 1 3,264 24 107,427,957 870 114

2 1 1,127 17 38,454,827 435 115

3 1 1,123 43 5,668,860 360 75

4 1 863 22 93,960 360 116

5 1 844 7 7,718,232 75 18

… … … … … … …

1,481 1 -4 63 4,038,158 14 4

1,482 1 -5 32 1,275,204 3 3

1,483 0 -7 18 5,841,447 24 12

1,484 1 -7 106 3,401,541 4 11

1,485 1 -8 22 9,834,340 21 15

Note. DV means dependent variable (Y-axis). IV means independent variable (X axis containing confirmed predictors). The table displays the first five and the last five records collected. All data is shown before transformation.

103

Table A-3. List of SEDE terms

Table Name Column Data Type (Precision) Description

Badges116 Id INT (10) PK.

Badges UserId INT (10) FK to Users.

Badges Name NVARCHAR (50) Name of the badge.

Badges Date DATETIME yyyy-MM-dd

hh:mm:ss.fff.117

Badges Class TINYINT (3) Gold (1), silver (2), or bronze

(3).

Badges TagBased BIT True for tagged badges; false

for named badges.

CloseAsOffTopic Id SMALLINT (5) PK.

ReasonTypes

CloseAsOffTopic IsUniversal BIT -

ReasonTypes

116 See Appendix B, Table B-1, Queries #2 and #3.

117 All timestamps use UTC (Coordinated Universal Time or Greenwich Mean Time). To make a conversion to EST (Eastern Standard Time), use Query #4. To list all time zones, use Query #5.

104

CloseAsOffTopic MarkdownMini NVARCHAR (500) Close reason’s markdown.

ReasonTypes

CloseAsOffTopic CreationDate DATETIME (3) yyyy-MM-dd hh:mm:ss.fff.

ReasonTypes

CloseAsOffTopic CreationModeratorId INT (10) FK to Users.

ReasonTypes (optional)

CloseAsOffTopic ApprovalDate (optional) DATETIME (3) yyyy-MM-dd hh:mm:ss.fff.

ReasonTypes

CloseAsOffTopic ApprovalModeratorId INT (10) FK to Users.

ReasonTypes (optional)

CloseAsOffTopic DeactivationDate DATETIME yyyy-MM-dd hh:mm:ss.fff.

ReasonTypes (optional)

CloseAsOffTopic DeactivationModeratorId INT (10) FK to Users.

ReasonTypes (optional)

CloseReasonTypes Id TINYINT (3) PK.

CloseReasonTypes Name NVARCHAR (200) Current:

101 = Duplicate

102 = Off-topic

103 = Unclear what you are

asking

105

104 = Too broad

105 = Primarily opinion-

based

Deprecated:

1 = Exact duplicate

2 = Off-topic

3 = Not constructive,

subjective, or argumentative

4 = Not a real question

7 = Too localized

10 = General reference

20 = Noise or pointless

CloseReasonTypes Description (optional) NVARCHAR (500) Detailed description.

Comments Id INT (10) PK.

Comments PostId INT (10) FK to Posts.

Comments Score INT (10) Relevance index. There is no

voting feature for comments;

hence, they have no impact

on a user’s reputation.

106

Comments Text NVARCHAR (600) Comment body. Only users

having a reputational score of

50 or higher can enter

comments.

Comments CreationDate DATETIME (3) When the comment was

created.

Comments UserDisplayName NVARCHAR (30) Author’s display name or

(optional) anonymous.

Comments UserId (optional) INT (10) FK to Users. The user

account might not exist.

FlagTypes Id TINYINT (3) PK.

FlagTypes Name NVARCHAR (50) Question recommended close

(13), question close (14), and

question reopen (15).

FlagTypes Description NVARCHAR (500) A user without close

privileges suggests a

question should be closed.

User with close privileges is

voting to reopen a question.

User with close privileges is

voting to close a question.

107

PendingFlags Id INT (10) PK.

PendingFlags FlagTypeId TINYINT (3) FK to FlagTypes.

PendingFlags PostId INT (10) FK to Posts.

PendingFlags CreationDate (optional) DATE Record creation date.

PendingFlags CloseReasonTypeId TINYINT (3) FK to CloseReasonTypes.

(optional)

PendingFlags CloseAsOffTopicReason SMALLINT (5) FK to CloseAsOffTopic

TypeId (optional) ReasonTypes, but only when

the close reason is off-topic.

PendingFlags DuplicateOfQuestionId INT (10) FK to Posts for old or current

(optional) duplicates.

PendingFlags BelongsOnBaseHost NVARCHAR (100) Votes to close and migrate.

Address (optional)

PostFeedback Id INT (10) PK. This table stores positive

and negative votes from non-

registered users.

PostFeedback PostId INT (10) FK to Posts.

108

PostFeedback IsAnonymous (optional) BIT Anonymous or unregistered

users with no reputation.

PostFeedback VoteTypeId TINYINT (3) FK to VoteTypes.

PostFeedback CreationDate DATETIME (3) Record creation date.

PostHistory Id INT (10) PK.

PostHistory PostHistoryTypeId TINYINT (3) FK to PostHistoryTypes.

PostHistory PostId INT (10) FK to Posts.

PostHistory RevisionGUID UNIQUEIDENTIFIER Group multiple historical

records that occurred in a

single action.

PostHistory CreationDate DATETIME (3) Record creation date.

PostHistory UserId (optional) INT (10) FK to Users.

PostHistory UserDisplayName NVARCHAR (40) Only visible when the User

(optional) ID is not available.

PostHistory Comment (optional) NVARCHAR (400) Comments entered by users

editing the post. When the

value equals 10, the close

reason will be visible. When

109

the value equals 33 or 34, the

Post Notice ID will be

visible.

PostHistory Text (optional) NVARCHAR (-1) Revision’s content: JSON

string with all users who

voted when Post History

Type ID equals 10, 11, 12,

13, 14, 15, 19, 20, or 35;

JSON string with all Original

Question IDs if it is a

duplicate close vote; and, if

the ID equals 17, it will

contain migration metadata.

PostHistoryTypes Id TINYINT (3) PK:

1 = Initial title for questions

only

2 = Initial body

3 = Initial tags for questions

only

4 = Edit title for questions

only

5 = Edit body for raw

markdown

6 = Edit tags for questions

only

110

7 = Rollback title for

questions only

8 = Rollback body for raw

markdown

9 = Rollback tags for

questions only

10 = Post closed

11 = Post reopened

12 = Post deleted

13 = Post undeleted or

restored

14 = Post locked by

moderator

15 = Post unlocked by

moderator

16 = Community-owned

17 = Post migrated (replaced

by codes 35 and 36, which

stand for away and here)

18 = Question merged with

deleted question

19 = Question protected by

moderator

20 = Question unprotected by

moderator

21 = Post disassociated by

administrator

111

22 = Question unmerged

(metadata was restored to a

previously-merged question)

24 = Suggested edit applied

25 = Post tweeted

31 = Comment discussion

moved to chat

33 = Post notice added (FK

to PostNotices)

34 = Post notice removed

35 = Post migrated away

(refer to code 17)

36 = Post migrated here

(refer to code 17)

37 = Post merge source

38 = Post merge destination

50 = Community bump

Deprecated:

23 = Unknown development-

related event

26 = Vote nullification by

developer

27 = Post unmigrated or

hidden by a moderator

28 = Unknown suggestion

event

112

29 = Unknown moderator

event (possibly due to

dewikification)

30 = Unknown event

PostHistoryTypes Name NVARCHAR (50) Name associated with the ID,

as outlined above.

PostLinks Id INT (10) PK.

PostLinks CreationDate DATETIME (3) Record creation date.

PostLinks PostId INT (10) FK to Posts (original or

source).

PostLinks RelatedPostId INT (10) FK to Posts (related or

target).

PostLinks LinkTypeId TINYINT (3) Type of link (in reference to

Related Post ID):

1 = Linked

3 = Duplicate

PostNotices Id INT (10) PK.

PostNotices PostId INT (10) FK to Posts.

113

PostNotices PostNoticeTypeId INT (10) FK to PostNoticeTypes.

(optional)

PostNotices CreationDate DATETIME (3) Record creation date.

PostNotices DeletionDate (optional) DATETIME (3) yyyy-MM-dd hh:mm:ss.fff.

PostNotices ExpiryDate (optional) DATETIME (3) yyyy-MM-dd hh:mm:ss.fff.

PostNotices Body (optional) NVARCHAR (-1) Custom text that

accompanies the notice.

PostNotices OwnerUserId (optional) INT (10) FK to Users.

PostNotices DeletionUserId (optional) INT (10) FK to Users.

PostNoticeTypes Id INT (10) PK.

PostNoticeTypes ClassId TINYINT (3) 1 = Historical lock

2 = Bounty

4 = Moderator notice

PostNoticeTypes Name (optional) NVARCHAR (80) Name of the notice type.

PostNoticeTypes Body (optional) NVARCHAR (-1) Default notice content.

PostNoticeTypes IsHidden BIT Whether the notice type is

visible or not.

114

PostNoticeTypes Predefined BIT Whether the notice type is a

constant or not.

PostNoticeTypes PostNoticeDurationId INT (10) -1 = No duration specified

1 = Seven days for bounties

Posts Id INT (10) PK. This table stores all

active, non-deleted posts.

Posts PostTypeId TINYINT (3) FK to PostTypes.

Posts AcceptedAnswerId INT (10) Only visible when the Post

(optional) Type ID equals one (1).

Posts ParentId (optional) INT (10) Only visible when the Post

Type ID equals two (2).

Posts CreationDate DATETIME (3) Record creation date.

Posts DeletionDate (optional) DATETIME (3) Available only when there is

a PostsWithDeleted record.

Posts Score INT (10) Relevance score.

Posts ViewCount (optional) INT (10) The number of times the post

has been viewed.

115

Posts Body (optional) NVARCHAR (-1) Displayed as rendered (not

markdown) HTML.

Posts OwnerUserId (optional) INT (10) FK to Users. Only visible

when the user is active. Wiki

entries are owned by the

community and have a value

of -1.

Posts OwnerDisplayName NVARCHAR (40) User’s friendly name.

(optional)

Posts LastEditorUserId INT (10) Last user who edited the

(optional) post.

Posts LastEditorDisplayName NVARCHAR (40) User’s friendly name.

(optional)

Posts LastEditDate (optional) DATETIME (3) Most recent edit date/time.

Posts LastActivityDate DATETIME (3) Most recent activity

(optional) date/time.

Posts Title (optional) NVARCHAR (250) Post’s title.

Posts Tags (optional) NVARCHAR (250) Tags or keywords used. Each

post can have a maximum of

five tags, considering it

116

might be related to several

subjects.

Posts AnswerCount (optional) INT (10) Number of answers entered.

Posts CommentCount (optional) INT (10) Number of comments

entered.

Posts FavoriteCount (optional) INT (10) The number of times the post

has been favorited.

Posts ClosedDate (optional) DATETIME (3) Visible when the post is

closed.

Posts CommunityOwnedDate DATETIME (3) Visible when the post

(optional) belongs to a community

wiki.

PostsWithDeleted Id INT (10) PK. This table’s schema

duplicates Posts and stores

all deleted posts.

PostsWithDeleted PostTypeId TINYINT (3) FK to PostTypes.

PostsWithDeleted AcceptedAnswerId INT (10) Not populated.

(optional)

117

PostsWithDeleted ParentId (optional) INT (10) Only visible when the Post

Type ID equals 2.

PostsWithDeleted CreationDate DATETIME (3) Record creation date.

PostsWithDeleted DeletionDate (optional) DATETIME (3) Available only when there is

a PostsWithDeleted record.

PostsWithDeleted Score INT (10) Relevance score.

PostsWithDeleted ViewCount (optional) INT (10) Not populated.

PostsWithDeleted Body (optional) NVARCHAR (-1) Not populated.

PostsWithDeleted OwnerUserId (optional) INT (10) Not populated.

PostsWithDeleted OwnerDisplayName NVARCHAR (40) Not populated.

(optional)

PostsWithDeleted LastEditorUserId INT (10) Not populated.

(optional)

PostsWithDeleted LastEditorDisplayName NVARCHAR (40) Not populated.

(optional)

PostsWithDeleted LastEditDate (optional) DATETIME (3) Not populated.

118

PostsWithDeleted LastActivityDate DATETIME (3) Not populated.

(optional)

PostsWithDeleted Title (optional) NVARCHAR (250) Not populated.

PostsWithDeleted Tags (optional) NVARCHAR (250) Tags or keywords used.

PostsWithDeleted AnswerCount (optional) INT (10) Not populated.

PostsWithDeleted CommentCount (optional) INT (10) Not populated.

PostsWithDeleted FavoriteCount (optional) INT (10) Not populated.

PostsWithDeleted ClosedDate (optional) DATETIME (3) Visible when the post is

closed.

PostsWithDeleted CommunityOwnedDate DATETIME (3) Not populated.

(optional)

PostTags PostId INT (10) FK to Posts.

PostTags TagId INT (10) FK to Tags.

PostTypes Id TINYINT (3) PK:

1 = Question

2 = Answer

3 = Orphaned tag wiki

4 = Tag wiki excerpt

119

5 = Tag wiki

6 = Moderator nomination

7 = Wiki placeholder (or

election description)

8 = Privileged wiki

PostTypes Name NVARCHAR (50) Name of the post type.

ReviewRejection Id TINYINT (3) PK:

Reasons 1 = Radical change

2 = Vandalism

3 = Too minor

4 = Invalid edit

5 = Copied content

6 = Wiki not helpful

7 = Excerpt not helpful

8 = Suggested edit conflict

9 = Critical issues

101 = Spam or vandalism

102 = No improvement

whatsoever

103 = Irrelevant tag

104 = Clearly conflicts with

author’s intent

105 = Attempt to reply

106 = Copied content

107 = Lacks usage guidance

108 = Suggested edit conflict

120

109 = Critical issues

110 = Circular tag definition

ReviewRejection Name NVARCHAR (100) Review rejection reason’s

Reasons name.

ReviewRejection Description NVARCHAR (300) A detailed description of the

Reasons review rejection reason.

ReviewRejection PostTypeId (optional) TINYINT (3) FK to PostTypes, only when

Reasons Post Types ID equals Wiki

not helpful (6) or Excerpt not

helpful (7).

ReviewTask Id INT (10) PK.

Results

ReviewTask ReviewTaskId INT (10) FK to ReviewTasks.

Results

ReviewTask ReviewTaskResultTypeId TINYINT (3) FK to

Results ReviewTaskResultTypes.

ReviewTask CreationDate (optional) DATE (0) Record creation date.

Results

121

ReviewTask RejectionReasonId TINYINT (3) FK to

Results (optional) ReviewRejectionReasons for

suggested edits.

ReviewTask Comment (optional) NVARCHAR (150) Any clarifying comments.

Results

ReviewTask Id TINYINT (3) PK:

ResultTypes 1 = Not sure

2 = Approve for suggested

edits

3 = Reject for suggested edits

4 = Delete for low-quality

posts

5 = Edit for first-time (low-

quality posts and late

answers)

6 = Close for low-quality

posts

7 = Looks good for high-

quality posts

8 = Do not close

9 = Recommend deletion for

low-quality answers

10 = Recommend close for

low-quality questions

11 = I am done (related to

first posts)

122

12 = Reopen

13 = Leave closed

14 = Edit and reopen

15 = Excellent by community

evaluation

16 = Satisfactory by

community evaluation

17 = Needs improvement by

community evaluation

18 = No action needed

(related to first posts and late

answers)

19 = Reject and edit

20 = Should be improved

21 = Unsalvageable

ReviewTask Name NVARCHAR (100) Name of the review task

ResultTypes result type.

ReviewTask Description NVARCHAR (300) A detailed description of the

ResultTypes review task result type.

ReviewTasks Id INT (10) PK.

ReviewTasks ReviewTaskTypeId TINYINT (3) FK to ReviewTaskTypes.

ReviewTasks CreationDate (optional) DATE (0) Record creation date.

123

ReviewTasks DeletionDate (optional) DATE (0) Record deletion date.

ReviewTasks ReviewTaskStateId TINYINT (3) FK to ReviewTaskStates.

ReviewTasks PostId INT (10) FK to Posts.

ReviewTasks SuggestedEditId INT (10) FK to SuggestedEdits. It

(optional) contains internal numbering

for historicity.

ReviewTasks CompletedByReview INT (10) FK to ReviewTaskResults

TaskId (optional) for the outcome of a

completed review.

ReviewTaskStates Id TINYINT (3) PK:

1 = Active or when the

review task is still in the

queue.

2 = Completed or the review

task was completed so it is

no longer in the queue.

3 = Invalidated when the task

was dequeued naturally. A

post might be edited to

achieve higher quality, or

close votes might expire, all

of which are completely

124

separated from the review

dashboard.

ReviewTaskStates Name NVARCHAR (50) Name of the review task

state.

ReviewTaskStates Description NVARCHAR (300) A detailed description of the

review task state.

ReviewTaskTypes Id TINYINT (3) PK:

1 = Suggested edit

2 = Close votes

3 = Low-quality posts

4 = First post

5 = Late answer

6 = Reopen vote

7 = Community evaluation

10 = Triage

11 = Helper

Deprecated:

8 = Link validation

9 = Flagged posts

ReviewTaskTypes Name NVARCHAR (50) Name of the review task

type.

125

ReviewTaskTypes Description NVARCHAR (300) A detailed description of the

review task type.

SuggestedEdits Id INT (10) PK. This record is visible

(under review) when both

approval and rejection dates

are blank, and the

ReviewTasks record holds an

active state.

SuggestedEdits PostId INT (10) FK to Posts.

SuggestedEdits CreationDate (optional) DATETIME (3) Record creation date.

SuggestedEdits ApprovalDate (optional) DATETIME (3) Record approval date is only

visible for approved edits.

SuggestedEdits RejectionDate (optional) DATETIME (3) Record rejection date is only

visible for rejected edits.

SuggestedEdits OwnerUserId (optional) INT (10) FK to Users.

SuggestedEdits Comment (optional) NVARCHAR (800) Additional user comment.

SuggestedEdits Text (optional) NVARCHAR (-1) Edit content.

SuggestedEdits Title (optional) NVARCHAR (250) Edit title.

126

SuggestedEdits Tags (optional) NVARCHAR (250) Edit tags.

SuggestedEdits RevisionGUID (optional) UNIQUEIDENTIFIER Group multiple edits.

SuggestedEdit Id INT (10) PK.

Votes

SuggestedEdit SuggestedEditId INT (10) FK to SuggestedEdits.

Votes

SuggestedEdit UserId INT (10) FK to Users (the person who

Votes is making the suggestion).

SuggestedEdit VoteTypeId TINYINT (3) FK to VoteTypes.

Votes

SuggestedEdit CreationDate DATETIME (3) Record creation date.

Votes

SuggestedEdit TargetUserId (optional) INT (10) FK to Users (the person who

Votes is the target of the

suggestion).

SuggestedEdit TargetRepChange INT (10) Targeted area to be edited.

Votes

Tags Id INT (10) PK.

127

Tags TagName (optional) NVARCHAR (35) Name of the tag.

Tags Count INT (10) The number of times it has

been listed.

Tags ExcerptPostId (optional) INT (10) FK to Posts.

Tags WikiPostId (optional) INT (10) FK to Posts.

TagSynonyms Id INT (10) PK.

TagSynonyms SourceTagName NVARCHAR (35) FK to Tags (origin).

(optional)

TagSynonyms TargetTagName NVARCHAR (35) FK to Tags (destination).

(optional)

TagSynonyms CreationDate DATETIME (3) Record creation date.

TagSynonyms OwnerUserId INT (10) FK to Users.

TagSynonyms AutoRenameCount INT (10) The number of times the tag

has been autocorrected.

TagSynonyms LastAutoRename DATETIME (3) Last time the tag was

(optional) autocorrected.

TagSynonyms Score INT (10) Relevance score.

128

TagSynonyms ApprovedByUserId INT (10) FK to Users.

(optional)

TagSynonyms ApprovalDate (optional) DATETIME (3) Record approval date.

Users Id INT (10) PK.

Users Reputation INT (10) User reliability or estimated

measure of community trust

toward a user. A maximum

of 200 points can be gained

per diem:

+5: Question upvote

+10: Answer upvote

+15: Accepted/chosen

answer

Higher reputational scores

will unlock privileges and

enable access to additional

features. Stack Overflow

values the community’s

feedback over asker’s

feedback.

Users CreationDate DATETIME (3) Record creation date.

129

Users DisplayName (optional) NVARCHAR (40) User friendly name.

Users LastAccessDate DATETIME (3) Last date/time when a user

loaded or viewed a page. At

best, this is updated every

thirty minutes.

Users WebsiteUrl (optional) NVARCHAR (200) User’s personal website.

Users Location (optional) NVARCHAR (100) City or region where the user

claims to reside.

Users AboutMe (optional) NVARCHAR (-1) Biographical information or

role description.

Users Views INT (10) The number of times the

profile has been viewed

(popularity).

Users UpVotes INT (10) The number of positive votes

received (favorable

popularity).

Users DownVotes INT (10) The number of negative

votes received (detrimental

popularity).

130

Users ProfileImageUrl NVARCHAR (200) Link to the user’s picture,

(optional) avatar, or logo.

Users EmailHash (optional) VARCHAR (32) Deprecated field.

Users AccountId (optional) INT (10) Stack Exchange Network’s

Profile ID.

Votes Id INT (10) PK.

Votes PostId INT (10) FK to Posts.

Votes VoteTypeId TINYINT (3) FK to VoteTypes.

Votes UserId (optional) INT (10) FK to Users. Only visible

when the Vote Type ID

equals 5 (favorited) or 8

(bountied). The Vote Type

ID will equal -1 for deleted

users.

Votes CreationDate (optional) DATETIME (3) The record creation date is

hidden to protect the user’s

privacy.

Votes BountyAmount (optional) INT (10) Only visible when the Vote

Type ID equals 8 or 9; i.e.,

131

the bounty was started or

closed.

VoteTypes Id TINYINT (3) PK:

1 = Accepted by originator

2 = UpMod or

approve/positive/upvote

3 = DownMod or

reject/negative/downvote

4 = Offensive

5 = Favorite (includes the

User ID)

6 = Close (creates a

PostHistory record as of June

2013)

7 = Reopen

8 = Bounty start (populates

User ID and bounty amount)

9 = Bounty close (populates

a bounty amount)

10 = Deletion

11 = Undeletion

12 = Spam

15 = Moderator review (the

moderator has viewed the

post after the post was

flagged for moderator’s

attention)

132

16 = Approve edit suggestion

(a user voted to approve a

suggested edit). For getting

approved, most posts require

several user votes or a single,

binding moderator vote

VoteTypes Name NVARCHAR (50) Name of the vote type.

133

Appendix B

Table B-1. Query list118

Query #1.

SELECT CAST(AVG([Score]) AS DECIMAL(10, 2)) FROM [Posts];

Query #2.

SELECT c.[Table_Name], CASE WHEN ##PK:STRING?id## = c.[Column_Name] THEN CONCAT(c.[Column_Name], ' (PK)') ELSE c.[Column_Name] END AS Column_Name, [Data_Type], [Is_Nullable], COALESCE(CHARACTER_MAXIMUM_LENGTH, Numeric_Precision, DateTime_Precision) AS [Length / Precision] FROM [Information_Schema].[Columns] c WHERE c.[Table_Name] = ##TABLE:STRING?Posts## ORDER BY [Ordinal_Position] ASC;

118 All queries listed here were created by the Stack Exchange community and are publicly available. Source: https://Data.StackExchange.com/stackoverflow/queries.

134

Query #3.

SELECT 'query://874190/?table=' + [Table_Name] + '|' + [Table_Name] AS [Table(Links to Sample Data)], ‐‐ Sample data subquery. [Ordinal_Position] AS [Field Number], [Column_Name] AS [Column], CASE WHEN [Data_Type] LIKE '%VARCHAR%' THEN [Data_Type] + '(' + CAST(Character_Maximum_Length AS VARCHAR) + ')' WHEN [Data_Type] LIKE '%INT%' THEN [Data_Type] + '(' + CAST(Numeric_Precision AS VARCHAR) + ')' ELSE [Data_Type] END AS [Type], [Table_Schema], ‐‐ dbo. [Column_Default], ‐‐ Nullable. [DateTime_Precision], ‐‐ It has a value of three for DateTime data type. [Numeric_Precision_Radix], ‐‐ Only visible for TinyInt, SmallInt, and Int data types. [Numeric_Scale] ‐‐ Nullable. FROM [Information_Schema].[Columns] ‐‐WHERE ‐‐ [Table_Name] LIKE '%Types%' ‐‐ Only applicable to static referential tables. ORDER BY [Table_Name] ASC, [Ordinal_Position] ASC;

Query #4.

SELECT TOP (1) [CreationDate] AS [CreationDate] , DATEADD(HOUR, ‐4, [CreationDate]) AS [CreationDateEST] , GETDATE() AS [CurrentDate] , DATEADD(HOUR, ‐4, GETDATE()) AS [CurrentDateEST] FROM [Posts];

Query #5.

SELECT * FROM [Sys].[Time_Zone_Info];

135