Mining DEV for Social and Technical Insights About Software Development
Total Page:16
File Type:pdf, Size:1020Kb
Mining DEV for social and technical insights about software development Maria Papoutsoglou∗y, Johannes Wachszx, Georgia M. Kapitsakiy ∗Aristotle University of Thessaloniki, Greece yUniversity of Cyprus, Cyprus zVienna University of Economics and Business, Austria xComplexity Science Hub Vienna, Austria [email protected]; [email protected]; [email protected] Abstract—Software developers are social creatures: they com- On more socially oriented platforms like Twitter, which limits municate, collaborate, and promote their work in a variety of post lengths to 280 characters, discussions about software mix channels. Twitter, GitHub, Stack Overflow, and other platforms with an endless variety of other content. offer developers opportunities to network and exchange ideas. Researchers analyze content on these sites to learn about trends To address this gap we present a novel source of long-form and topics in software engineering. However, insight mined text data created by people working in software called DEV from the text of Stack Overflow questions or GitHub issues (https://dev.to). DEV is “a community of software developers is highly focused on detailed and technical aspects of software getting together to help one another out,” focused especially development. In this paper, we present a relatively new online on facilitating cooperation and learning. Content on DEV community for software developers called DEV. On DEV users write long-form posts about their experiences, preferences, and resembles blog and Medium posts and, at a glance, covers working life in software, zooming out from specific issues and files everything from programming language choice to technical to reflect on broader topics. About 50,000 users have posted over specifications of systems to social factors in software. Other 140,000 articles related to software development. In this work, users can interact with posts by liking them, commenting on we describe the content of posts on DEV using a topic model, them, or sharing them on other platforms. Contributions to showing that developers discuss a rich variety and mixture of social and technical aspects of software development. We show DEV are longer than tweets: the nearly 140,000 texts in our that developers use DEV to promote themselves and their work: dataset contain on average 710 words (median 461, stdev. 990). 83% link their profiles to their GitHub profiles and 56% to Users can fill out an extensive profile, including links to their their Twitter profiles. 14% of users pin specific GitHub repos in other accounts on the web and to their personal projects on their profiles. We argue that DEV is emerging as an important GitHub. This suggests that DEV has significant potential for hub for software developers, and a valuable source of insight for researchers to complement data from platforms like GitHub and developers to promote themselves and to drive attention to Stack Overflow. their software projects. Index Terms—Developers, social networks, human factors, Text posted on DEV has a broader perspective than content topic modeling, GitHub, DEV that can be collected from GitHub or Stack Overflow, while remaining focused on software. Potential applications include I. INTRODUCTION predicting language and framework popularity, measuring the social attitudes of developers, and understanding the general Online communities and platforms play an important role in trends in software. As we will show, many DEV users share the professional lives of software developers [1]. Developers links to their profiles on multiple platforms including GitHub, find ideas, collaborators, and even jobs through connections on Twitter and Linkedin, highlighting the platform’s potential for software oriented platforms like GitHub and Stack Overflow, arXiv:2103.17054v2 [cs.SE] 16 May 2021 the study of cross-platform behavior [5]. About 85% of the and on traditional social media including Facebook and Twitter users in our dataset listing both a GitHub and Twitter account [2]. Their interactions and the content posted on these plat- are not in the state of the art dataset linking GitHub and forms are a valuable source of information about the practice Twitter from MSR 2020 [6]. A significant number of users link of software engineering. The software engineering research to platforms of emerging interest to the software engineering community has long been interested in predicting current community such as Youtube and Twitch. practice and future trends using digital trace data [3], [4]. The aim of our work is to give the empirical software Despite the richness of these data sources, researchers using engineering community a first look at DEV and to promote them in this way often make broad inferences about software its potential for research. We collect data from the platform engineering from highly contextualized content. GitHub issues and apply a topic modeling approach to map the topics of or commit messages generally refer to specific pieces of code. conversation in the community. We find that users discuss Stack Overflow questions also tend to be highly specialized. a rich mixture of social and technical aspects of software To appear in the Proceedings of the 18th International Conference on development, often in the same post. We also describe the Mining Software Repositories (MSR 2021). information users share about themselves, noting that many users link to their accounts on multiple other sites, and to stored metadata and author information for every article. For personal projects. We highlight the potential of this new each unique user we crawled the information available on their platform to complement data sources traditionally used by the public profile, including links to other platforms. Using the software engineering community to study social aspects of official DEV API (https://docs.forem.com/api/), we identified software and trends in development. more than 53,000 posting users, which we used to verify the completeness of our collection of articles. Roughly 33% of II. RELATED WORK users wrote 80% of articles in our corpus. As an article can Open source software developers use a variety of channels have more than one tag, we removed duplicate articles based to promote themselves and their projects [7]. As part of these on their URL and title. We extracted the full text of 147,030 channels social media plays an important role in software engi- articles. Using a language detection method we filtered out neering [8], offering communication (sharing content, meeting non-English articles leaving a corpus of 138,925 articles. new people), coordination (about events, releases, calls to Preprocessing articles: In the data collection step we action) and promotion (launches, new opportunities). Social retrieved the HTML format of the text; so we were able to find media and blogs also play an important role in the labor market specific HTML elements which we used to process the data. for job-seekers and recruiters alike, especially when combined In DEV posts, as in Stack Overflow questions and answer, with content from online coding platforms. Companies and code snippets are enclosed in the <code> </code> HTML projects expend considerable effort and resources to attract element. As we are interested for the natural language part the best developers, including examining the social web [9]. of the articles, we removed these code snippets, other HTML This online presence of developers can be utilized directly, for elements (e.g. <li> or <p>), and URLs. Using the Quanteda example matching GitHub profiles to job ads [10]. [21] R package we ran the texts through a standard prepro- Text analysis has been applied to many kinds of data cessing pipeline, removing punctuation, number symbols, and generated by software developers, including data coming from stopwords, and casting to lowercase. We stemmed the remain- the above sources. Though typically unstructured, text con- ing words in each text and created unigrams and bigrams. After tains important information about how people think and feel evaluating preliminary results from various topic models, we about software development. Common sources for text about discarded the unigrams as topics derived from bigrams were software include GitHub [11], Bugzilla [12], Stack Overflow found to have significantly richer meaning [22]. and mailing lists [13]. A common approach is to fit a topic Topic modeling setup: Before proceeding with the topic model to a corpus. A topic model is an unsupervised method modeling, we trimmed bigrams appearing in fewer than 90 which assigns topics to documents. This approach has been texts and less than 200 times in total to reduce noise in the applied to data from technical contexts including GitHub corpus. An important step before finalizing a topic modeling pull requests [14], Stack Overflow Q&A and other Stack approach is to determine the best K number of topics. A Exchange sites [15]–[17], and Gitter [18]. These sources focus popular metric to find the best fitted K value is the use on specific aspects of the software development process, and of Cv coherence score [16], [23]. We performed a range of the resulting corpora are oriented towards technical themes. different experiments by varying K from 5 to 50 in steps of But software engineering researchers are not only interested five recording each run’s coherence score. 30 topics had a in how developers talk about specific technical issues. Seeking high coherence score and was also evaluated by two authors to find data that better captures social