An Analysis of Python's Topics, Trends, and Technologies Through Mining Stack Overflow Discussions
Total Page:16
File Type:pdf, Size:1020Kb
1 An Analysis of Python’s Topics, Trends, and Technologies Through Mining Stack Overflow Discussions Hamed Tahmooresi, Abbas Heydarnoori, and Alireza Aghamohammadi F Abstract—Python is a popular, widely used, and general-purpose pro- (LDA) [3] to categorize discussions in specific areas such gramming language. In spite of its ever-growing community, researchers as web programming [4], mobile application development [5], have not performed much analysis on Python’s topics, trends, and tech- security [6], and blockchain [7]. However, none of them has nologies which provides insights for developers about Python commu- focused on Python. In this study, we shed light on the nity trends and main issues. In this article, we examine the main topics Python’s main areas of discussion through mining 2 461 876 related to this language being discussed by developers on one of the most popular Q&A websites, Stack Overflow, as well as temporal trends posts, from August 2008 to January 2019, using LDA based through mining 2 461 876 posts. To be more useful for the software topic modeling . We also analyze the temporal trends of engineers, we study what Python provides as the alternative to popular the extracted topics to find out how the developers’ interest technologies offered by common programming languages like Java. is changing over the time. After explaining the topics and Our results indicate that discussions about Python standard features, trends of Python, we investigate the technologies offered web programming, and scientific programming 1 are the most popular by Python, which are not properly revealed by the topic areas in the Python community. At the same time, areas related to the modeling. Because of the significant prevalence of Python, scientific programming are steadily receiving more attention from the software engineers, mostly the experts of other program- Python developers. ming languages, may be eager to know the technologies of their interest provided by Python as the alternative to Index Terms—Python, Q&A websites, Stack Overflow, Topic Modeling, Trend Analysis. investigate this language further. That expert may search online resources and read a lot to find an appropriate answer. However, the aggregation and verification of the 1 INTRODUCTION obtained knowledge may be on her own. As a result, by leveraging the word embedding model [8], we can find YTHON is a widely used, high-level, general-purpose, out for each technology provided for other programming P interpreted, and dynamic programming language [1]. languages what technologies are offered by Python which According to Stack Overflow annual survey engaging more are practically employed by Python developers. Hence, than 90 000 developers in 2019, Python just edged out Java we pave the way for better understanding Python’s so- in overall ranking at the end of 2018, much like it edged out lutions. The main findings from our study indicate that C# in 2017, and PHP in 2016 alike. Stack Overflow called Python developers mostly discuss Python standard features, Python the fast-growing major programming language [2]. In web programming, and scientific programming. However, the spite of the prevalence of Python, researchers have not arXiv:2004.06280v1 [cs.SE] 14 Apr 2020 popularity of scientific programming is growing faster. adequately focused on analyzing the trends and technolo- gies of this language in software developer communities, and Question and Answer (Q&A) websites such as Stack Overflow which is the de facto website actively used by 2 STUDY SETUP developers. Analyzing this invaluable resource of informa- tion can help developers to gain insight about Python’s The primary goal of our study is to extract the areas that trends and main issues. Prior studies have leveraged a Python developers discuss on Stack Overflow. For this pur- topic modeling approach called Latent Dirichlet Allocation pose, we introduce three research questions: RQ1: What are the discussion areas in the Python commu- • H. Tahmooresi, A. Heydarnoori, and A. Aghamohammadi are with the Department of Computer Engineering, Sharif University of Technology, nity? Iran. RQ2: How is the interest of Python developers changing E-mail: [email protected] over the time? E-mail: [email protected] E-mail: [email protected] RQ3: What technologies does Python provide? 1. Programming in areas such as mathematics, data science, statistics, Figure1 illustrates the high-level steps we take to answer machine learning, natural language processing (NLP), and so forth. our research questions. 2 The Stack Overflow’s Dataset (Aug 2008 to Jan 2019) 1 2 Extract the tags associated with each language Filter Python discussions 9 3 Prune low quality discussions Extract a dataset containing just tags 4 10 Preprocess discussions bodies Generate a word embedding model 5 Extract topics using LDA algorithm 6 Cluster topics 7 Partition dataset Answer 8 Answer Answer RQ1 Analyze temporal trends RQ2 RQ3 Fig. 1. Our high-level study methodology. 2.1 RQ1: What are the discussion areas in the Python LDA model with 100 topics for Python discussions reveals community? topics such as game programming, IoT, and testing. However, another model with 20 topics suppresses these areas. On the To address RQ1, we first extracted the tags associated with other hand, as the number of topics increases, their visu- each language (Step 1). Similar to prior studies, we identi- alization and analysis become hard to understand. Besides, fied discussions of each language according to Stack Over- many duplicates emerge among topics, and thus a manual flow’s tag mechanism [9], [10]. In this website, the owner of merge is required [5]. Therefore, we first generated a LDA a question may assign, up to five tags as keywords to each model with 100 topics and then asked two Python experts question which denotes the technological categories of that — who are not the authors of this paper — with more question [10]. We preprocessed Python posts extracted (Step than seven years of experience to manually merge the LDA 2) as the input of our topic modeling. model. We pruned questions with the minus or zero score with- The experts observed that topics extracted from Stack out any accepted answers in order to improve the overall Overflow usually obtain known technologies among their quality of the topic modeling (Step 3). Due to the scoring top probable words (e.g., XML, server, web framework names, mechanism, these questions have poor quality [4]. This way, and so forth). Therefore, if two topics share similar words 9% of all the Python posts were pruned from our dataset. having high occurrence probability, which match the name Next, we cleansed a textual content of the extracted posts of known technologies, they may be good candidates for (Step 4) as follows: being merged. For example, if two topics have the word 1) All code snippets enclosed in the <code> tag were django — a Python web framework — as a highly probable discarded since the source code would introduce noise word, we can anticipate that they are both about web pro- in the results of the topic modeling [11]. gramming subject. As a result, the experts considered Stack 2) All HTML tags were removed as well (e.g., <a Overflow tags as a source of words resembling technologies href="...">, <b>, and so forth). to help merge topics more accurately. Having examined 3) We teokenized the corpus and removed common the Stack Overflow tags along with fine-grained topics, our English-language stop words such as a, the, and is experts decided to merge topics into 12 clusters (Step 6). which do not help creating meaningful topics. Note that, the experts operated independent of each other. 4) All the tokens were stemmed using the Porter stem- Then, they shared the results. In the case of difference, they ming algorithm [12]. discussed until all of the disagreements were resolved. Afterwards, we exploited the popular topic modeling tech- nique, LDA, to investigate the developers’ discussion areas 2.2 RQ2: How is the interest of Python developers (Step 5). LDA is an unsupervised model for performing sta- changing over the time? tistical topic modeling that uses a bag of words approach [3]. Recently, LDA has been employed in software engineering To address RQ2, we partitioned the dataset (obtained in communities such as Stack Overflow to extract topics dis- Step 2) into three-month time intervals (Step 7). Three- cussed in the crowd [4], [5]. It takes the number of topics K month intervals yielded enough discussions in each interval as an input. Larger values of K will result in fine-grained which make our analysis more reliable. Since the Python topics and lower values will produce coarse-grained topics. community is constantly growing, we observed that over Unfortunately, by choosing a small K (e.g., 10 or 20), many the time, the frequency of each cluster is increasing as well. discussion areas may remain hidden. As an example, an Thus, to better analyze the data, we borrowed the concept 3 of impact of a topic presented by Barua et al. [10] to obtain assigned the tags <java, swing, jframe>, we simply consider the portion of a topic tk in a time interval intv: ”java swing jframe” as an item of our dataset. Finally, we 1 X trained our word embedding model using our dataset (Step impact(tk; intv) = θ(dj; tk) (Eq. 1) 10). D(intv) dj 2D(intv) Where D(intv) is the set of all posts in the time interval 3 RESULTS intv. θ(dj; tk) denotes the probability of a particular topic Herein, we present the results and findings of our study. tk occurring in the document dj. In order for calculating the impact of each cluster in an interval, we simply sum up the impact of its topic.