Enforcing Abstract Immutability

Total Page:16

File Type:pdf, Size:1020Kb

Enforcing Abstract Immutability Enforcing Abstract Immutability by Jonathan Eyolfson A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Electrical and Computer Engineering Waterloo, Ontario, Canada, 2018 © Jonathan Eyolfson 2018 Examining Committee Membership The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote. External Examiner Ana Milanova Associate Professor Rensselaer Polytechnic Institute Supervisor Patrick Lam Associate Professor University of Waterloo Internal Member Lin Tan Associate Professor University of Waterloo Internal Member Werner Dietl Assistant Professor University of Waterloo Internal-external Member Gregor Richards Assistant Professor University of Waterloo ii I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Abstract Researchers have recently proposed a number of systems for expressing, verifying, and inferring immutability declarations. These systems are often rigid, and do not support “abstract immutability”. An abstractly immutable object is an object o which is immutable from the point of view of any external methods. The C++ programming language is not rigid—it allows developers to express intent by adding immutability declarations to methods. Abstract immutability allows for performance improvements such as caching, even in the presence of writes to object fields. This dissertation presents a system to enforce abstract immutability. First, we explore abstract immutability in real-world systems. We found that developers often incorrectly use abstract immutability, perhaps because no programming language helps developers correctly implement abstract immutability. We believe that this omission leads to incorrect usages. Specifically, we wrote a dynamic analysis that reports any writes through immutability declarations. To our knowledge, this work was the first to explore how objects implement abstract immutability (or fail to implement it). Our novel study found three uses of abstract immutability: caching, delayed initialization, and unit testing. Unit testing was a surprising application of abstract immutability, and we believe that the ability to modify state is needed for proper unit testing. Next, we explore developers’ revealed needs for immutability in the source code. We found that the majority of classes contain a mix of immutable and mutable methods, with a majority of the overall methods being immutable. Immutability systems with only im- mutable or all-mutating classes are insufficient: developers need immutability declarations at method granularity. Our study then combined developer immutability declarations with results from a static analysis to estimate the true number of immutable methods. The static analysis checked that no transitive writes to a receiver object occurred. Our results indicated the need for a sophisticated analysis to check that these apparently abstractly immutable methods were indeed abstractly immutable. Finally, we created a novel static analysis which checks that developers follow abstract immutability. Specifically, we define abstract immutability to mean that a class’s setof immutable methods is collectively free of writes to exposed fields. Our analysis found incorrect usages of abstract immutability, such as incorrect caching. This analysis is par- ticularly valuable in the context of code evolution, whereby subsequent programmers may make changes that break previously-correct cache implementations, for instance. Our work allows developers to trust that their code is abstractly immutable. iv Acknowledgements First, I need to thank my parents. Especially my dad, who has supported me to follow my passion no matter what. Whether I’m at my highest highs or lowest lows, he has guided me through them all. Thank you Grandma Cook, you’ve always ensured I was taken care of and I couldn’t have done it without you. I would also like to thank Tracy for all her love and support, and putting up with horrible puns. I would be remiss if I didn’t thank my teachers at Frontenac High School: Mr. Kotecha, Mr. Wills, Mr. Lee, and Mr. Grew. I would not have found my passion if it wasn’t for them opening my eyes, and setting an excellent example. They encouraged the pursuit of knowledge, and curiosity I didn’t know I had. I owe a deep gratitude to my supervisor Patrick Lam, who has been guiding me ever since we first met at Curry Original in Kingston. Without his knowledge and expertiseI would not have completed my masters, much less a doctorate. He has been patient and understanding through my graduate studies, and always allowed me to explore problems that interested me. Thank you to my committee members. I appreciate the time you’ve taken out of your busy schedules to add your valuable insight to this work. Finally, I would like to thank friends that I’ve been able to bounce ideas off of during my research. There are too many to name, you know who you are, thank you. For the purpose of this thesis in particular: Jean-Christophe Petkovich, you’ve been invaluable, as both a set of open ears and a friend dating back from grade 3. v Dedication In memory of Sisters Kathryn and Mary Joan LaFleur. vi Table of Contents List of Figures x Listings xii List of Tables xv 1 Introduction 1 1.1 Motivation .................................... 2 1.2 Case Study: Clang ............................... 4 1.3 Exploratory Study ............................... 4 1.4 Outline ..................................... 6 2 Background 7 2.1 Immutability Terminology ........................... 7 2.1.1 Transitive and Non-Transitive ..................... 7 2.1.2 Concrete and Abstract ......................... 8 2.1.3 Object and Reference Immutability .................. 8 2.1.4 Relation to Purity ........................... 9 2.2 Immutability in C++ .............................. 9 2.2.1 const keyword ............................. 10 2.2.2 mutable keyword ............................ 11 2.2.3 Bitwise const .............................. 12 2.2.4 Logically const ............................. 12 2.2.5 Additional Loopholes .......................... 12 2.3 Immutability in Java .............................. 15 2.3.1 final Keyword ............................. 17 2.3.2 Javari .................................. 17 2.4 Static Analysis ................................. 17 2.4.1 Dataflow Analysis ........................... 17 2.4.2 Pointer Analysis ............................ 18 3 Dynamic Observations of Writes-Through-Immutability 19 3.1 Motivation .................................... 20 3.2 Technique .................................... 23 3.3 Classification .................................. 29 vii 3.4 Results ...................................... 32 3.4.1 Protobuf ................................. 33 3.4.2 LevelDB ................................. 35 3.4.3 fish shell ................................. 38 3.4.4 Mosh (mobile shell) .......................... 38 3.4.5 LLVM TableGen ............................ 39 3.4.6 Tesseract ................................ 40 3.4.7 Ninja .................................. 40 3.4.8 Weston ................................. 40 3.4.9 Summary ................................ 42 4 Static Observations of Immutability 43 4.1 Motivation .................................... 44 4.2 Technique .................................... 45 4.3 Results ...................................... 48 4.3.1 LLVM .................................. 50 4.3.2 OpenCV ................................. 51 4.3.3 Protobuf ................................. 53 4.3.4 fish shell ................................. 54 4.3.5 libsequence ............................... 56 4.3.6 Mosh .................................. 57 4.3.7 Ninja .................................. 58 4.3.8 Summary ................................ 60 5 Abstract Immutability Analysis 63 5.1 Motivating Example .............................. 64 5.2 Technique .................................... 68 5.2.1 Formalization .............................. 69 5.2.2 Analysis Operation ........................... 71 5.2.3 Abstraction ............................... 71 5.2.4 Running Example ........................... 74 5.2.5 Transfer Functions ........................... 77 5.2.6 Memory Alias Sets ........................... 79 5.2.7 Merge Operation ............................ 83 5.2.8 Assumptions .............................. 86 5.3 Results ...................................... 87 5.3.1 libsequence ............................... 87 5.3.2 fish shell ................................. 90 5.3.3 Mosh .................................. 92 5.3.4 Ninja .................................. 92 5.3.5 Summary ................................ 93 viii 6 Related Work 95 6.1 Type Systems .................................. 96 6.1.1 Javari .................................. 96 6.1.2 ReIm .................................. 97 6.1.3 Glacier ................................. 98 6.2 Type Inference ................................. 98 6.2.1 Javarifier ................................ 99 6.2.2 ReImInfer ................................ 99 6.2.3 Inferring const for C programs .................... 100 6.3 Language Features ............................... 100 6.3.1 Usability of Language Features .................... 100 6.3.2 Frozen Objects ............................. 101 6.3.3
Recommended publications
  • What If What I Need Is Not in Powerai (Yet)? What You Need to Know to Build from Scratch?
    IBM Systems What if what I need is not in PowerAI (yet)? What you need to know to build from scratch? Jean-Armand Broyelle June 2018 IBM Systems – Cognitive Era Things to consider when you have to rebuild a framework © 2017 International Business Machines Corporation 2 IBM Systems – Cognitive Era CUDA Downloads © 2017 International Business Machines Corporation 3 IBM Systems – Cognitive Era CUDA 8 – under Legacy Releases © 2017 International Business Machines Corporation 4 IBM Systems – Cognitive Era CUDA 8 Install Steps © 2017 International Business Machines Corporation 5 IBM Systems – Cognitive Era cuDNN and NVIDIA drivers © 2017 International Business Machines Corporation 6 IBM Systems – Cognitive Era cuDNN v6.0 for CUDA 8.0 © 2017 International Business Machines Corporation 7 IBM Systems – Cognitive Era cuDNN and NVIDIA drivers © 2017 International Business Machines Corporation 8 IBM Systems – Cognitive Era © 2017 International Business Machines Corporation 9 IBM Systems – Cognitive Era © 2017 International Business Machines Corporation 10 IBM Systems – Cognitive Era cuDNN and NVIDIA drivers © 2017 International Business Machines Corporation 11 IBM Systems – Cognitive Era Prepare your environment • When something goes wrong it’s better to Remove local anaconda installation $ cd ~; rm –rf anaconda2 .conda • Reinstall anaconda $ cd /tmp; wget https://repo.anaconda.com/archive/Anaconda2-5.1.0-Linux- ppc64le.sh $ bash /tmp/Anaconda2-5.1.0-Linux-ppc64le.sh • Activate PowerAI $ source /opt/DL/tensorflow/bin/tensorflow-activate • When you
    [Show full text]
  • Scala Tutorial
    Scala Tutorial SCALA TUTORIAL Simply Easy Learning by tutorialspoint.com tutorialspoint.com i ABOUT THE TUTORIAL Scala Tutorial Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. Scala has been created by Martin Odersky and he released the first version in 2003. Scala smoothly integrates features of object-oriented and functional languages. This tutorial gives a great understanding on Scala. Audience This tutorial has been prepared for the beginners to help them understand programming Language Scala in simple and easy steps. After completing this tutorial, you will find yourself at a moderate level of expertise in using Scala from where you can take yourself to next levels. Prerequisites Scala Programming is based on Java, so if you are aware of Java syntax, then it's pretty easy to learn Scala. Further if you do not have expertise in Java but you know any other programming language like C, C++ or Python, then it will also help in grasping Scala concepts very quickly. Copyright & Disclaimer Notice All the content and graphics on this tutorial are the property of tutorialspoint.com. Any content from tutorialspoint.com or this tutorial may not be redistributed or reproduced in any way, shape, or form without the written permission of tutorialspoint.com. Failure to do so is a violation of copyright laws. This tutorial may contain inaccuracies or errors and tutorialspoint provides no guarantee regarding the accuracy of the site or its contents including this tutorial. If you discover that the tutorialspoint.com site or this tutorial content contains some errors, please contact us at [email protected] TUTORIALS POINT Simply Easy Learning Table of Content Scala Tutorial ..........................................................................
    [Show full text]
  • Open Source Copyrights
    Kuri App - Open Source Copyrights: 001_talker_listener-master_2015-03-02 ===================================== Source Code can be found at: https://github.com/awesomebytes/python_profiling_tutorial_with_ros 001_talker_listener-master_2016-03-22 ===================================== Source Code can be found at: https://github.com/ashfaqfarooqui/ROSTutorials acl_2.2.52-1_amd64.deb ====================== Licensed under GPL 2.0 License terms can be found at: http://savannah.nongnu.org/projects/acl/ acl_2.2.52-1_i386.deb ===================== Licensed under LGPL 2.1 License terms can be found at: http://metadata.ftp- master.debian.org/changelogs/main/a/acl/acl_2.2.51-8_copyright actionlib-1.11.2 ================ Licensed under BSD Source Code can be found at: https://github.com/ros/actionlib License terms can be found at: http://wiki.ros.org/actionlib actionlib-common-1.5.4 ====================== Licensed under BSD Source Code can be found at: https://github.com/ros-windows/actionlib License terms can be found at: http://wiki.ros.org/actionlib adduser_3.113+nmu3ubuntu3_all.deb ================================= Licensed under GPL 2.0 License terms can be found at: http://mirrors.kernel.org/ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all. deb alsa-base_1.0.25+dfsg-0ubuntu4_all.deb ====================================== Licensed under GPL 2.0 License terms can be found at: http://mirrors.kernel.org/ubuntu/pool/main/a/alsa- driver/alsa-base_1.0.25+dfsg-0ubuntu4_all.deb alsa-utils_1.0.27.2-1ubuntu2_amd64.deb ======================================
    [Show full text]
  • 1 Aliasing and Immutability
    Vocabulary: Accessors & Mutators Computer Science and Engineering The Ohio State University Accessor: A method that reads, but never changes, the Aliasing and Immutability (abstract) state of an object Computer Science and Engineering College of Engineering The Ohio State University Concrete representation may change, so long as change is not visible to client eg Lazy initialization Examples: getter methods, toString Lecture 8 Formally: restores “this” Mutator method: A method that may change the (abstract) state of an object Examples: setter methods Formally: updates “this” Constructors not considered mutators A Fixed Epoch Interface A Broken Time Period Class Computer Science and Engineering The Ohio State University Computer Science and Engineering The Ohio State University // Interface cover story goes here public class Period implements FixedEpoch { // Mathematical modeling … private Date start; // constructor private Date end; // requires start date < end date // initializes start and end dates // operations have specifications based on the model public Period(Date start, Date end) { // exercises … this.start = start; this.e nd = e nd; public interface FixedEpoch { } public Date getStart(); public Date getEnd(); } public Date getStart() { return start; } public Date getEnd() { return end; } } Problem: Aliasing A Better Period Class Computer Science and Engineering The Ohio State University Computer Science and Engineering The Ohio State University Assignment in constructor creates an alias public class Period
    [Show full text]
  • Functional Programming Inside OOP?
    Functional Programming inside OOP? It’s possible with Python >>>whoami() Carlos Villavicencio ● Ecuadorian θ ● Currently: Python & TypeScript ● Community leader ● Martial arts: 剣道、居合道 ● Nature photography enthusiast po5i Cayambe Volcano, 2021. >>>why_functional_programming ● Easier and efficient ● Divide and conquer ● Ease debugging ● Makes code simpler and readable ● Also easier to test >>>history() ● Functions were first-class objects from design. ● Users wanted more functional solutions. ● 1994: map, filter, reduce and lambdas were included. ● In Python 2.2, lambdas have access to the outer scope. “Not having the choice streamlines the thought process.” - Guido van Rossum. The fate of reduce() in Python 3000 https://python-history.blogspot.com/2009/04/origins-of-pythons-functional-features.html >>>has_django_fp() https://github.com/django/django/blob/46786b4193e04d398532bbfc3dcf63c03c1793cb/django/forms/formsets.py#L201-L213 https://github.com/django/django/blob/ca9872905559026af82000e46cde6f7dedc897b6/django/forms/formsets.py#L316-L328 Immutability An immutable object is an object whose state cannot be modified after it is created. Booleans, strings, and integers are immutable objects. List and dictionaries are mutable objects. Thread safety >>>immutability def update_list(value: list) -> None: def update_number(value: int) -> None: value += [10] value += 10 >>> foo = [1, 2, 3] >>> foo = 10 >>> id(foo) >>> update_number(foo) 4479599424 >>> foo >>> update_list(foo) 10 >>> foo 樂 [1, 2, 3, 10] >>> id(foo) 4479599424 >>>immutability def update_number(value: int) -> None: print(value, id(value)) value += 10 print(value, id(value)) >>> foo = 10 >>> update_number(foo) 10 4478220880 ڃ 4478221200 20 >>> foo 10 https://medium.com/@meghamohan/mutable-and-immutable-side-of-python-c2145cf72747 Decorators They are functions which modify the functionality of other functions. Higher order functions.
    [Show full text]
  • Mochi-JCST-01-20.Pdf
    Ross R, Amvrosiadis G, Carns P et al. Mochi: Composing data services for high-performance computing environments. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 35(1): 121–144 Jan. 2020. DOI 10.1007/s11390-020-9802-0 Mochi: Composing Data Services for High-Performance Computing Environments Robert B. Ross1, George Amvrosiadis2, Philip Carns1, Charles D. Cranor2, Matthieu Dorier1, Kevin Harms1 Greg Ganger2, Garth Gibson3, Samuel K. Gutierrez4, Robert Latham1, Bob Robey4, Dana Robinson5 Bradley Settlemyer4, Galen Shipman4, Shane Snyder1, Jerome Soumagne5, and Qing Zheng2 1Argonne National Laboratory, Lemont, IL 60439, U.S.A. 2Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A. 3Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada 4Los Alamos National Laboratory, Los Alamos NM, U.S.A. 5The HDF Group, Champaign IL, U.S.A. E-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected] E-mail: [email protected]; [email protected]; [email protected]; [email protected] E-mail: [email protected]; [email protected]; [email protected]; {bws, gshipman}@lanl.gov E-mail: [email protected]; [email protected]; [email protected] Received July 1, 2019; revised November 2, 2019. Abstract Technology enhancements and the growing breadth of application workflows running on high-performance computing (HPC) platforms drive the development of new data services that provide high performance on these new platforms, provide capable and productive interfaces and abstractions for a variety of applications, and are readily adapted when new technologies are deployed. The Mochi framework enables composition of specialized distributed data services from a collection of connectable modules and subservices.
    [Show full text]
  • Master Thesis
    Master thesis To obtain a Master of Science Degree in Informatics and Communication Systems from the Merseburg University of Applied Sciences Subject: Tunisian truck license plate recognition using an Android Application based on Machine Learning as a detection tool Author: Supervisor: Achraf Boussaada Prof.Dr.-Ing. Rüdiger Klein Matr.-Nr.: 23542 Prof.Dr. Uwe Schröter Table of contents Chapter 1: Introduction ................................................................................................................................. 1 1.1 General Introduction: ................................................................................................................................... 1 1.2 Problem formulation: ................................................................................................................................... 1 1.3 Objective of Study: ........................................................................................................................................ 4 Chapter 2: Analysis ........................................................................................................................................ 4 2.1 Methodological approaches: ........................................................................................................................ 4 2.1.1 Actual approach: ................................................................................................................................... 4 2.1.2 Image Processing with OCR: ................................................................................................................
    [Show full text]
  • An Accuracy Examination of OCR Tools
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-9S4, July 2019 An Accuracy Examination of OCR Tools Jayesh Majumdar, Richa Gupta texts, pen computing, developing technologies for assisting Abstract—In this research paper, the authors have aimed to do a the visually impaired, making electronic images searchable comparative study of optical character recognition using of hard copies, defeating or evaluating the robustness of different open source OCR tools. Optical character recognition CAPTCHA. (OCR) method has been used in extracting the text from images. OCR has various applications which include extracting text from any document or image or involves just for reading and processing the text available in digital form. The accuracy of OCR can be dependent on text segmentation and pre-processing algorithms. Sometimes it is difficult to retrieve text from the image because of different size, style, orientation, a complex background of image etc. From vehicle number plate the authors tried to extract vehicle number by using various OCR tools like Tesseract, GOCR, Ocrad and Tensor flow. The authors in this research paper have tried to diagnose the best possible method for optical character recognition and have provided with a comparative analysis of their accuracy. Keywords— OCR tools; Orcad; GOCR; Tensorflow; Tesseract; I. INTRODUCTION Optical character recognition is a method with which text in images of handwritten documents, scripts, passport documents, invoices, vehicle number plate, bank statements, Fig.1: Functioning of OCR [2] computerized receipts, business cards, mail, printouts of static-data, any appropriate documentation or any II. OCR PROCDURE AND PROCESSING computerized receipts, business cards, mail, printouts of To improve the probability of successful processing of an static-data, any appropriate documentation or any picture image, the input image is often ‘pre-processed’; it may be with text in it gets processed and the text in the picture is de-skewed or despeckled.
    [Show full text]
  • Exploring Language Support for Immutability
    Exploring Language Support for Immutability Michael Coblenz∗, Joshua Sunshine∗, Jonathan Aldrich∗, Brad Myers∗, Sam Webery, Forrest Shully May 8, 2016 CMU-ISR-16-106 This is an extended version of a paper that was published at ICSE [11]. Institute for Software Research School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 ∗School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA ySoftware Engineering Institute, Carnegie Mellon University, Pittsburgh, PA, USA This material is supported in part by NSA lablet contract #H98230-14-C-0140, by NSF grant CNS-1423054, and by Contract No. FA8721-05-C-0003 with CMU for the operation of the SEI, a federally funded research and development center sponsored by the US DoD. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of any of the sponsors. Keywords: Programming language design, Programming language usability, Immutability, Mutability, Program- mer productivity, Empirical studies of programmers Abstract Programming languages can restrict state change by preventing it entirely (immutability) or by restricting which clients may modify state (read-only restrictions). The benefits of immutability and read-only restrictions in software structures have been long-argued by practicing software engineers, researchers, and programming language designers. However, there are many proposals for language mechanisms for restricting state change, with a remarkable diversity of tech- niques and goals, and there is little empirical data regarding what practicing software engineers want in their tools and what would benefit them. We systematized the large collection of techniques used by programming languages to help programmers prevent undesired changes in state.
    [Show full text]
  • CSI: Inferring Mobile ABR Video Adaptation Behavior Under HTTPS and QUIC
    CSI: Inferring Mobile ABR Video Adaptation Behavior under HTTPS and QUIC Shichang Xu Subhabrata Sen Z. Morley Mao University of Michigan AT&T Labs – Research University of Michigan Abstract Server Manifest Network Client Mobile video streaming services have widely adopted Adap- Chunks HTTP tive Bitrate (ABR) streaming to dynamically adapt the stream- Track ing quality to variable network conditions. A wide range of 720p 1 Buffer third-party entities such as network providers and testing 480p IP packets services need to understand such adaptation behavior for 360p 1 2 3 Index purposes such as QoE monitoring and network management. CSI The traditional approach involved conducting test runs and analyzing the HTTP-level information from the associated network traffic to understand the adaptation behavior under Figure 1. ABR streaming overview different network conditions. However, end-to-end traffic encryption protocols such as HTTPS and QUIC are being increasingly used by streaming services, hindering such tra- Rate (ABR) streaming (predominantly HLS [75] and DASH [31]) ditional traffic analysis approaches. has been widely adopted in industry for delivering satisfac- To address this, we develop CSI (Chunk Sequence Infer- tory Quality of Experience (QoE) over dynamic cellular net- encer), a general system that enables third-parties to conduct work conditions. The server encodes each video into multiple active measurements and infer mobile ABR video adapta- versions with different picture quality levels and encoding tion behavior based on packet size and timing information bitrates (with higher bitrates for higher-quality encodings) still available in the encrypted traffic. We perform exten- called tracks, and splits each track into shorter chunks, each sive evaluations and demonstrate that CSI achieves high representing a few seconds worth of playback content (Fig- inference accuracy for video encodings of popular streaming ure 1).
    [Show full text]
  • 1 Immutable Classes
    A Broken Time Period Class Computer Science and Engineering The Ohio State University public class Period { private Date start; Immutable Classes private Date end; Computer Science and Engineering College of Engineering The Ohio State University public Period(Date start, Date end) { assert (start.compareTo(end) > 0); //start < end this.start = start; this.end = end; Lecture 8 } public Date getStart() { return start; } public Date getEnd() { return end; } } Questions Problem: Aliasing Computer Science and Engineering The Ohio State University Computer Science and Engineering The Ohio State University What is an invariant in general? Assignment in constructor creates an alias Ans: Client and component both have references to the same Date object Class invariant can be undermined via alias What is an invariant for class Period? Date start = new Date(300); Ans: Date end = new Date (500); Period p = new Period (start, end); end.setTime(100); //modifies internals of p Why is this an invariant for this class? Solution: “defensive copying” Ans: Constructor creates a copy of the arguments Copy is used to initialize the private fields Metaphor: ownership A Better Period Class Good Practice: Copy First Computer Science and Engineering The Ohio State University Computer Science and Engineering The Ohio State University public class Period { private Date start; When making a defensive copy of private Date end; constructor arguments: public Period(Date start, Date end) { First copy the arguments assert (start.compareTo(end) >
    [Show full text]
  • A Dataset for Github Repository Deduplication
    A Dataset for GitHub Repository Deduplication Diomidis Spinellis Audris Mockus Zoe Kotti [email protected] {dds,zoekotti}@aueb.gr University of Tennessee Athens University of Economics and Business ABSTRACT select distinct p1, p2 from( select project_commits.project_id as p2, GitHub projects can be easily replicated through the site’s fork first_value(project_commits.project_id) over( process or through a Git clone-push sequence. This is a problem for partition by commit_id empirical software engineering, because it can lead to skewed re- order by mean_metric desc) as p1 sults or mistrained machine learning models. We provide a dataset from project_commits of 10.6 million GitHub projects that are copies of others, and link inner join forkproj.all_project_mean_metric each record with the project’s ultimate parent. The ultimate par- on all_project_mean_metric.project_id = ents were derived from a ranking along six metrics. The related project_commits.project_id) as shared_commits projects were calculated as the connected components of an 18.2 where p1 != p2; million node and 12 million edge denoised graph created by direct- Listing 1: Identification of projects with common commits ing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched GitHub contains many millions of copied projects. This is a prob- clumping projects. Projects that introduced unwanted clumping lem for empirical software engineering. First, when data contain- were identified by repeatedly visualizing shortest path distances ing multiple copies of a repository are analyzed, the results can between unrelated important projects. Our dataset identified 30 end up skewed [27]. Second, when such data are used to train thousand duplicate projects in an existing popular reference dataset machine learning models, the corresponding models can behave of 1.8 million projects.
    [Show full text]