App Store Analysis for Software Engineering

UNIVERSITY COLLEGE LONDON COMPUTER SCIENCE A THESIS PRESENTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE App Store Analysis for Software Engineering JANUARY 25, 2017 Author: WILLIAM MARTIN Supervisors: Examiners: MARK HARMAN RACHEL HARRISON YUE JIA WALID MAALEJ FEDERICA SARRO 1 I, William Martin confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Abstract App Store Analysis concerns the mining of data from apps, made possible through app stores. This thesis extracts publicly available data from app stores, in order to detect and analyse relationships between technical attributes, such as software features, and non-technical attributes, such as rating and popularity information. The thesis identifies the App Sampling Problem, its effects and a methodology to ameliorate the problem. The App Sampling Problem is a fundamental sampling issue concerned with mining app stores, caused by the rather limited ‘most-popular- only’ ranked app discovery present in mobile app stores. This thesis provides novel techniques for the analysis of technical and non-technical data from app stores. Topic modelling is used as a feature extraction technique, which is shown to produce the same results as n-gram feature extraction, that also enables linking technical features from app descriptions with those in user reviews. Causal impact analysis is applied to app store performance data, leading to the identification of properties of statistically significant releases, and developer-controlled properties which could increase a release’s chance for causal significance. This thesis introduces the Causal Impact Release Analysis tool, CIRA, for performing causal impact analysis on app store data, which makes the aforementioned research possible; combined with the earlier feature extraction technique, this enables the identification of the claimed software features that may have led to significant positive and negative changes after a release. Impact Statement The work in this thesis seeks primarily to answer the question of “what makes successful apps successful?” through performing empirical analysis of app store data. The contributions work toward this goal, first establishing a feature extraction technique using topic modelling, and identifying the sampling issues present in mining app store data (publication at MSR 2015). This work looks at time series data, identifying the app releases that had the greatest subsequent effect on app success, and analysing common factors (publications at ICSE 2016 and FSE 2016). One of the outcomes of this work is the tool, CIRA, an implementation of causal impact analysis, a time series analysis technique. The secondary goal of this thesis is to help define the growing field of “app store analysis for software engineering”, which it does through a comprehensive review of literature in the field (accepted for publication in the Transactions of Software Engineering journal). Publications The following papers were authored as part of this PhD. First-Author Peer-Reviewed Papers – W. Martin, M. Harman, Y. Jia, F. Sarro, and Y. Zhang, “The app sampling problem for app store mining,” in Proceedings of the 12th IEEE Working Conference on Mining Software Repositories, MSR ’15, 2015, pp. 123–133 Citations at the time of writing: 35 – W. Martin, “Causal impact for app store analysis,” in Companion Proceedings of the 38th International Conference on Software Engineering, ICSE Companion ’16. ACM, 2016 Citations at the time of writing: 3 – W. Martin, F. Sarro, and M. Harman, “Causal impact analysis for app releases in Google Play,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, FSE ’16, 2016, to appear Citations at the time of writing: 5 – W. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman, “A survey of app store analysis for software engineering,” IEEE Transactions on Software Engineering, 2016 Citations at the time of writing: 20 Technical Reports – W. Martin, F. Sarro, and M. Harman, “Causal impact analysis applied to app releases in Google Play and Windows Phone Store,” University College London, Tech. Rep., 2015, rN/15/07 Citations at the time of writing: 6 – W. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman, “A survey of app store analysis for software engineering,” University College London, Tech. Rep., 2016, rN/16/02 Citations at the time of writing: 20 Publications 5 Co-Authored Peer-Reviewed Papers In addition to the first-author papers listed above, I have also contributed to the following publications: – F. Sarro, A. Al-Subaihin, M. Harman, Y. Jia, W. Martin, and Y. Zhang, “Feature lifecycles as they spread, migrate, remain and die in app stores,” in Proceedings of the Requirements Engineering Conference, 23rd IEEE International (RE’15). IEEE, 2015 Citations at the time of writing: 15 – A. Al-Subaihin, A. Finkelstein, M. Harman, Y. Jia, W. Martin, F. Sarro, and Y. Zhang, “App store mining and analysis,” in Proceedings of the 3rd International Workshop on Software Development Lifecycle for Mobile, DeMobile 2015. ACM, 2015, pp. 1–2 Citations at the time of writing: 3 – M. Harman, A. Al-Subaihin, Y. Jia, W. Martin, F. Sarro, and Y. Zhang, “Mobile app and app store analysis, testing and optimisation,” in Proceedings of the International Conference on Mobile Software Engineering and Systems, MOBILESoft ’16. ACM, 2016, pp. 243–244 Citations at the time of writing: 1 Technical Reports – A. Finkelstein, M. Harman, Y. Jia, W. Martin, F. Sarro, and Y. Zhang, “App store analysis: Mining app stores for relationships between customer, business and technical characteristics,” Tech. Rep., 2014, rN/14/10 Citations at the time of writing: 11 Acknowledgements I’d like to send a big thanks to my supervisors Mark, Shin, Yue and Federica, for their guidance and support. A big thankyou to my dedicated proof reading team, who are now experts in App Store Analysis, Lynn, Steve and Aja. Your attention to detail has never wavered. And to Anne, your drive, motivation and passion for science inspire me. To support this PhD, I’m very grateful to have been in receipt of funding from the DAASE project grant. To everyone mentioned, without each of you none of this would have been possible. My heartfelt thanks. Contents Abstract 2 Impact Statement 3 Publications 4 Acknowledgements 6 1 Introduction 12 2 Literature Review 16 2.1 Introduction . 16 2.1.1 Contributions . 17 2.1.2 Definitions . 17 2.1.3 Overview . 19 2.2 Literature Search . 19 2.2.1 Scope . 19 2.2.2 Search Methodology . 20 2.2.3 Snowballing . 22 2.2.4 Search Results . 22 2.2.5 Lessons Learned . 24 2.3 Non-Technical Research . 25 2.4 Scale of Studies . 26 2.5 Key Ideas Timeline . 27 2.6 API Analysis . 27 2.6.1 API Usage . 30 2.6.2 Class Reuse and Inheritance . 31 2.6.3 Faults . 32 2.6.4 Permissions and Security . 33 2.7 Feature Analysis . 35 2.7.1 Automated Classification . 37 Contents 8 2.7.2 Clustering . 38 2.7.3 Lifecycles . 38 2.7.4 Recommendation . 39 2.7.5 Store Success . 40 2.7.6 Verification . 42 2.8 Release Engineering . 43 2.8.1 Content . 44 2.8.2 Success . 45 2.8.3 Strategy . 45 2.9 Review Analysis . 46 2.9.1 Classification . 48 2.9.2 Content . 49 2.9.3 Requirements Engineering . 50 2.9.4 Sentiment . 52 2.9.5 Summarisation . 53 2.9.6 Surveys and Methodological Aspects of App Store Analysis . 54 2.10 Security . 56 2.10.1 Faults . 56 2.10.2 Malware . 57 2.10.3 Permissions . 59 2.10.4 Plagiarism . 60 2.10.5 Privacy . 61 2.10.6 Vulnerability . 62 2.11 Store Ecosystem . 63 2.11.1 Inter-store . 63 2.11.2 Intra-store . 66 2.11.3 Recommendation . 67 2.11.4 Simulations . 67 2.12 Size and Effort Prediction . 68 2.13 Checklist for Future App Store Analysis Authors . 70 2.14 Threats to Validity . 71 2.15 Conclusions . 71 Contents 9 3 Methodology 73 3.1 Statistical Analysis . 73 3.2 Data . 74 3.3 Mining Process . 74 3.3.1 User Reviews . 77 3.3.2 Persistent List Collection . 77 3.3.3 Time Series Datasets . 78 3.4 Metrics . 78 3.5 Text Preprocessing . 78 3.6 Topic Modelling . 80 3.7 TF.IDF . 82 4 App Feature Extraction using Topic Models 83 4.1 Introduction . 83 4.2 Findings . 84 4.3 Definitions . 84 4.4 Data . 84 4.5 Research Questions . 85 4.6 Application of LDA . 85 4.7 Results . 86 4.8 The Problem of Zero Rated Apps . 90 4.9 Threats to Validity . 92 4.10 Conclusions . 93 5 The App Sampling Problem for App Store Mining 94 5.1 Introduction . 94 5.2 Contributions . 96 5.3 Definitions . 96 5.4 The App Sampling Problem . 97 5.5 Data . 99 5.6 Research Questions . 100 5.7 Topic Modelling . 104 5.8 Results . ..

Load more