Scaling Collaborative Open Data Science
Scaling Collaborative Open Data Science by Micah J. Smith B.A., Columbia University (2014) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2018 © Massachusetts Institute of Technology 2018. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science May 23, 2018 Certified by. Kalyan Veeramachaneni Principal Research Scientist Laboratory for Information and Decision Systems Thesis Supervisor Accepted by . Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students 2 Scaling Collaborative Open Data Science by Micah J. Smith Submitted to the Department of Electrical Engineering and Computer Science on May 23, 2018, in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Abstract Large-scale, collaborative, open data science projects have the potential to address important societal problems using the tools of predictive machine learning. However, no suitable framework exists to develop such projects collaboratively and openly, at scale. In this thesis, I discuss the deficiencies of current approaches and then develop new approaches for this problem through systems, algorithms, and interfaces. A cen- tral theme is the restructuring of data science projects into scalable, fundamental units of contribution. I focus on feature engineering, structuring contributions as the creation of independent units of feature function source code. This then facilitates the integration of many submissions by diverse collaborators into a single, unified, machine learning model, where contributions can be rigorously validated and verified to ensure reproducibility and trustworthiness.
[Show full text]