A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Andrew J. Simmons Scott Barnett Jessica Rivera-Villicana Deakin University Deakin University Deakin University Applied Artificial Intelligence Inst. Applied Artificial Intelligence Inst. Applied Artificial Intelligence Inst. Geelong, VIC, Australia Geelong, VIC, Australia Geelong, VIC, Australia [email protected] [email protected] [email protected] Akshat Bajaj Rajesh Vasa Deakin University Deakin University Applied Artificial Intelligence Inst. Applied Artificial Intelligence Inst. Geelong, VIC, Australia Geelong, VIC, Australia [email protected] [email protected]

ABSTRACT ACM Reference Format: Background: Meeting the growing industry demand for Data Sci- Andrew J. Simmons, Scott Barnett, Jessica Rivera-Villicana, Akshat Bajaj, ence requires cross-disciplinary teams that can translate machine and Rajesh Vasa. 2020. A large-scale comparative analysis of Coding Stan- dard conformance in Open-Source Data Science projects. In ESEM ’20: ACM learning research into production-ready code. Software engineer- / IEEE International Symposium on Empirical and Mea- ing teams value adherence to coding standards as an indication surement (ESEM) (ESEM ’20), October 8–9, 2020, Bari, Italy. ACM, New York, of code readability, maintainability, and developer expertise. How- NY, USA, 11 pages. https://doi.org/10.1145/3382494.3410680 ever, there are no large-scale empirical studies of coding standards focused specifically on Data Science projects. Aims: This study in- 1 INTRODUCTION vestigates the extent to which Data Science projects follow code standards. In particular, which standards are followed, which are Software that conforms to coding standards creates a shared under- ignored, and how does this differ to traditional software projects? standing between developers [14], and improves maintainability Method: We compare a corpus of 1048 Open-Source Data Science [8, 19]. Consistent coding standards also enhances the readability projects to a reference group of 1099 non-Data Science projects of the code by defining where to declare instance variables; how with a similar level of quality and maturity. Results: Data Science to name classes, methods, and variables; how to structure the code projects suffer from a significantly higher rate of functions that and other similar guidelines [17]. use an excessive numbers of parameters and local variables. Data A growing trend in the software industry is the inclusion of Data Science projects also follow different variable naming conventions Scientists on software engineering teams to build predictive models, to non-Data Science projects. Conclusions: The differences indicate analyze product and customer data, present insights to business that Data Science codebases are distinct from traditional software and to build data engineering pipelines [13]. Data Scientists may codebases and do not follow traditional software engineering con- not follow common coding standards as they often come from non- ventions. Our conjecture is that this may be because traditional software engineering backgrounds where software engineering software engineering conventions are inappropriate in the context best practices are unknown. Furthermore, a Data Scientist focuses of Data Science projects. on producing insights from data and building predictive models rather than on the inherent quality attributes of the code. In this CCS CONCEPTS context of a mixed team of Data Scientists and Software Engineers, team members must collaborate to produce a coherent software arXiv:2007.08978v2 [cs.SE] 28 Jul 2020 • Software and its engineering → Software libraries and repos- artefact and a consistent coding standard is essential for cohesive itories. collaboration and to maintain productivity. However, currently it KEYWORDS is unknown if Data Scientists follow common coding standards.