Language-independent volume measurement

Edwin Ouwehand [email protected]

Summer 2018, 52 pages

Supervisor: Ana Oprescu Organisation supervisor: Lodewijk Bergmans Second reader: Ana Varbanescu

Host organisation: Software Improvement Group, http://ww.sig.eu

Universiteit van Amsterdam Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master http://www.software-engineering-amsterdam.nl Contents

Abstract 3

1 Introduction 4 1.1 Problem statement...... 5 1.2 Research Questions...... 5 1.3 Software Improvement Group...... 6 1.4 Outline ...... 6

2 Background 7 2.1 Software Sizing...... 7 2.1.1 Function Points...... 7 2.1.2 Effort estimation...... 8 2.2 Expressiveness of programming languages...... 8 2.3 Kolmogorov Complexity...... 8 2.3.1 Incomputability and Estimation ...... 9 2.3.2 Applications ...... 9 2.4 Data compression...... 9 2.4.1 Compression Ratio...... 10 2.4.2 Lossless & Lossy...... 10 2.4.3 Archives...... 10

3 Research Method 12 3.1 Methodology ...... 12 3.2 Data...... 12 3.3 Counting Lines of Code ...... 13 3.4 The expressiveness spectrum ...... 13

4 Measuring information content 15 4.1 Compressor and Algorithm selection...... 15 4.1.1 Comparing algorithms...... 16 4.2 Archive selection...... 17 4.2.1 Overhead ...... 17 4.2.2 Comparing archives ...... 18 4.3 Project size...... 19 4.4 Discussion...... 20 4.5 Conclusion ...... 21

5 Expressiveness and information content 22 5.1 Determining language expressiveness levels ...... 22 5.2 Validation...... 25 5.3 Normalising LOC counts...... 25 5.4 Discussion...... 26 5.5 Conclusion ...... 27

6 Quality and relative verbosity 28

1 6.1 Experiment...... 28 6.2 Discussion...... 30 6.3 Conclusion ...... 30

7 Related Work 31 7.1 Normalised Compression Distance ...... 31 7.2 Calculating software productivity with compression...... 31 7.3 Determining software complexity...... 31

8 Conclusion 33 8.1 Future work...... 33

Acknowledgements 34

Bibliography 35

Appendices 39

A Cut-off Data 40

B Language Distributions 41

2 Abstract

The software size of a system is typically measured using lines of code. Lines of code are easy to obtain, but the amount of lines that is needed for implementing a given functionality is strongly influenced by the programming style and programming languages that are used. A new approach towards this is measuring ’information content’ of as an estimate for the size of a software system. With this approach we have successfully determined the expressiveness of various programming languages, however we were not able to verify the results definitively. As a practical application, we have proposed a new way to normalise lines of code counts. Finally, we found no relation between various quality metrics and a verbose style.

3 Chapter 1

Introduction

Determining the size of a software system is done for various reasons. It is typically used to predict the amount of effort that will be required to develop a system, as well as to estimate programming productivity or once the system has been developed. As explained by Galorath [GE06], estimates are only as good as the size projections they are based on. In the physical world, size is a measure of volume or mass. In the software world, though, size is not as clearly defined. Some metrics include: counting characters, tokens, lines of code (LOC), classes and function points. A well-established way to determine the size of a project, is by counting the lines of code that have been produced. Studies suggest [HPM03] that LOC often correlates to other measures of effort or functionality, such as function points. The of choice and the style of the programmer play a large role in the relation between the size in LOC and the actual effort required to create the system. As a result, for systems that are written in different languages, the number of lines is not comparable. Nevertheless, the amount of lines of code is generally accepted as a sensible and practical measure of the size of a system, because it can be accurately measured, fully automated and is easily comprehensible. The amount of lines required to express a certain amount of functionality in a language is an inherent property of said language and is typically referred to as the language gearing factor, language level or expressiveness. In this study, we are interested in determining a size measure that helps us to compare source code with regard to creation effort, namely the intellectual effort that goes into writing a number of lines of source code. We believe that the relevant size of a software system is proportional to the amount of information that is encapsulated in the code base. This follows the idea that the comprehension of program functionality is the comprehension of information and relationships [BH91], which in turn implies that the development process consists of translating knowledge about processes, activities, data structures and manipulations, and calculations into source code. Certain programming languages allow the programmer to express this in a very concise manner, where others do not. We argue that the bulk of the effort in developing software is spent thinking and reasoning about these concepts, and only a small portion is spent translating the result into source code. Therefore, our approach is based on the idea that the information content of a system is a reflection of its functional size. By applying this way of measuring the size of source code, we can derive a table containing expressiveness levels of various languages, a language level table. This would thus allow LOC measurements of projects in the future to be normalised, which we expect to be a better indication of the intellectual effort that went into creating a project than traditional methods. Lastly, we consider the qualitative aspect of unnecessarily verbose or duplicated code (relatively high line count and low functionality). Inexperienced developers often resort to code duplication, which is more bug-prone and costly to maintain, but it results in a higher LOC count. Therefore, we also investigate the relation between a system’s verbosity relative to systems in the same language and its other qualitative attributes.

4 1.1 Problem statement

Consider two applications that provide the exact same functionality (screens, reports, databases). One of the applications is written in Java and the other is written in Python. The amount of lines required for the Java implementation is expected to be higher, because Java is a more verbose language. We can observe this effect even at the smallest level. For example, in listings 1.1 and 1.2 we see ’Hello World’ coded in Java (5 LOC) and Python (2 LOC) respectively. Similarly, the LOC count of two identical Java applications could be vastly different based on code conventions and stylistic differences of the programmer. An experienced developer may be able to develop the same functionality with far less code.

1 public class HelloWorld { 1 2 public static void main(String[] args) { 2 #!/usr/bin/env python 3 System.out.println(”Hello, World”); 3 4 } 4 print ”Hello, world!” 5 } 5 Listing 1.1: Hello World coded in Java Listing 1.2: Hello World coded in Python

These limitations have led to the inception of backfiring, which is the conversion of lines of code to function points [GE06, Jon95] based on historical data. Examples of this include the SPR Program- ming Languages Table1 and the QSM Function Points Languages Table2, which describe backfiring ratios for various languages. With these ratios, LOC counts can essentially be normalised for the lan- guage used. Sadly, the benchmarking process for these tables is far from ideal. For one, because they are based on function points which in turn are counted from documentation rather than being based on the actual system. Though the counting process is standardised [Sse12] and can be automated, there still often exists a mismatch between the actual software and documentation. Secondly, software is often developed in more than one language. A variety of languages are often employed depending on the complexity and requirements. This means that functionality cannot directly be attributed to a particular language as it is interwoven throughout the system. We have reason to believe that these tables are indeed (to an extent) flawed. We can observe large differences between the same language in different tables, as well as differences between versions of the same language within the same table. Furthermore, the measurements are often inconsistent among these tables, for example disagreeing on whether JavaScript or COBOL is more verbose. Lastly, it is important to consider how measurements affect developer behaviour when used as a tool for effort or productivity estimation. A programmer whose productivity is being measured in lines of code will have an incentive to write unnecessarily verbose code. This is undesirable, as developers would be inclined to produce as many lines as possible, without necessarily adding value and potentially a negative effect on the quality of the code base. Or, as described by Jones [Jon94]: ”It is hazardous to use a metric that gets worse as real economic productivity gets better.” A measure which more closely represents the embedded functionality would not suffer this issue.

1.2 Research Questions

Following the dicussion in the previous sections, we can formulate four research questions, mentioned below. Before we attempt to determine the expressiveness of language and the verbosity of specific projects, we must first develop a methodology for estimating information content (see Chapter2).

RQ1: What is a suitable method for estimating information content in source code? RQ2: Can we distinguish expressiveness levels of programming languages based on information con- tent estimates? 1https://www.spr.com/ 2http://www.qsm.com/resources/function-point-languages-table

5 RQ3: Can lines of code measurements be normalised using information content estimates, so that the result is meaningful to the programmer? RQ4: What is relation between the verbosity of a system and its quality (i.e. is a project that written in a verbose style also of poorer quality and vice versa)?

1.3 Software Improvement Group

The study is hosted by Software Improvement Group (SIG), which is a management consultancy firm that focuses on software-related challenges. This environment allows us to work on real-life projects and employ methods that build on top of their existing static code analysis tools. A core aspect of SIG’s business is based on their maintainability model. The size of a system is also a parameter in this model, as typically the larger a system, the more effort required to maintain it. SIG has its own table for normalising the LOC count, which is based on existing tables and expert opinion.

1.4 Outline

In chapter2 we introduce the background on software sizing, expressiveness of programming languages, information content and compression, while in chapter ?? we outline the research method. Then in chapters4–6 we answer the research questions consecutively. The results are then compared to related work in chapter7 and finally, we present our conclusion and suggestions for future work in chapter8.

6 Chapter 2

Background

This chapter provides the background and context necessary to comprehend our research. The chap- ter presents information on software sizing, expressiveness of programming languages, Kolmogorov complexity and information content and compression.

2.1 Software Sizing

Software sizing is an activity in software engineering that is used to determine or estimate the size of a software application or component to be able to implement other software project management activities. For example, predicting the amount of source code and other deliverables that must be built to satisfy the requirements of a system. As stated by Pressman and Maxim: ”When compre- hensive software metrics are available, estimates can be made with greater assurance, schedules can be established to avoid past difficulties, and overall risk can be reduced” [PM07]. As described by Galorath [GE06], approaches to size estimation can generally be characterized as bottom-up and top-down. Estimation by expert opinion, analogy, or cost model can employ either approach, but decomposition is inherently a top-down method, i.e. starting with the entire software program and decomposing it into smaller pieces. Currently, the two predominant sizing measures are counting lines of code (LOC) and function point analysis.

2.1.1 Function Points Function points, though frequently criticized [Con90, Jon95], are the only functional sizing method recognized by ISO [Sse12]. Function points are primarily used to estimate the size of a system based on its requirement specification, using a manual estimation process. This involves a mixture of counting (of interfaces and files) and expert judgement (of complexity of the elements) [AG83]. To make accurate measurements, highly skilled and trained measurement experts are required. The Object Management Group1, a non-profit association for computer industry standards, has adopted the Automated Function Point (AFP) specification, which provides a standard for automating Function Point counting. This means that analysis can be reproduced accurately and is less time-consuming, however it does not address critical issues, such as inability to measure highly algorithmic complex functionality, requiring manual configuration, the need for a detailed system specification, and being a poor indicator of effort [Con90]. During early stages of a project, it is difficult to estimate the lines of code required. However, Function Points can be derived from requirements and therefore are useful when there is no code in methods such as estimation by proxy. Thus function point counting may still have its place in software estimation. Nevertheless, when determining the size of an already complete system, a more accurate measure that can be more consistently measured should be favoured.

1http://it-cisq.org/standards/automated-function-points/

7 2.1.2 Effort estimation Though size estimations serve as input for effort estimation, although it must be noted that the two are different. This is because productivity also plays a role in the latter. An inexperienced programmer might have to put in an excess amount of hours to produce the same functionality as a more experienced programmer. As a result, the productivity factor is exceedingly difficult to quantify [GE06]. functionality effort = productivity

Studies suggest that there is a relation between LOC and effort. However, research on the correlation between function points and lines of code is inconclusive, with studies suggesting both correlation and no correlation [Jon86, LJ90].

2.2 Expressiveness of programming languages

In the context of software engineering, the expressiveness of a language may refer to various mean- ings. It may refer to the expressiveness in a theoretical sense, to what extent a language is Turing Complete [Tel94], meaning the number of ideas that are expressible in a language regardless of ease. Alternatively, it may refer to the succinctness of a language, i.e. how concisely and readable an idea can be expressed. According to Felleisen [Fel91], one language is more expressive than another when it supports more programming constructs than the other. In his work on a formal framework for rea- soning about expressiveness of programming languages, he also states that programming languages and constructs can roughly be categorised as essential and syntactic sugar. Programming languages that support more essential constructs naturally allow for more ideas to be expressed in that language. However, syntactic sugar, such as list comprehensions, leads to a significant reduction in LOC, also most modern general purpose programming languages are Turing complete [Mic16]. Therefore, the most sensible/practical meaning of expressiveness lies with that of succinctness, which is the definition we will use through the thesis. Furthermore, it captures the differences between the effort that goes into developing functionality and the number of lines of code that are necessary to express this in a language. Little research has been done on the effects of the expressiveness of programming languages. In their study on programming language adoption, Meyerovich & Rabkin [MR13] find that high expressiveness is one of the most desired properties of a language. On one side of the spectrum, we can clearly see that coding in assembly leads to lower productivity. Raemaekers [Ste15] argues, however, that on the opposite side of the spectrum (e.g. APL [Ive62]), a small amount of code that does too much may also be hard to understand and thus to maintain. He suggests that there exists an optimal expressiveness for programming languages, where these two concerns strike a balance. Lastly, depending on how a language is used, its expressiveness may vary. For example, XML might be used to store data (redundant structure) or as a domain specific language (more expressive). Additionally, frameworks and libraries can influence how much code is required to implement a certain functionality.

2.3 Kolmogorov Complexity

The Kolmogorov complexity2 of an object, such as a piece of text, is the length of the shortest (in a predetermined programming language) that produces the object as output [Nan10, LV08]. The integer K(s) of string s gives the length of the shortest compressed binary version from which s can be fully reproduced, essentially perfect compression. As such, Kolmogorov complexity is simultaneously a measure of the computational resources needed to specify the object. The Kolmogorov complexity of an object can be viewed as an absolute and objective quantification of the amount of information in it [CVdW04]. This leads to a theory of absolute information contents of

2Kolmogorov complexity is sometimes also called algorithmic complexity or Turing complexity.

8 individual objects, in contrast to classic information theory, which deals with the average information to communicate information produced by a random source (e.g. data stream) [Ver98]. Even though s may be of infinite size, the information it contains can be very little [Sta07]. Thus, Length(s) = ∞ 6⇒ K(s) = ∞. An example of this is shown in listing 2.1, where the code generates a list of infinite size. However, it contains finite information, as it can be expressed in just a small piece of code3. This might be unintuitive as primes typically exhibit few patterns and thus might appear random.

1 primes :: [Integer] 2 primes = 2: 3: sieve (tail primes) [5,7..] 3 where 4 sieve (p:ps) xs = h ++ sieve ps [x | x <− t, x ‘rem‘ p /= 0] 5 where (h,˜( :t)) = span (< p∗p) xs Listing 2.1: Sieve based lazy primes implementation in Haskell

The pigeonhole principle states that if n items are put into m containers, with n > m, then at least one container must contain more than one item [Her75]. For a string s of length n and m possible shorter strings, there are inevitably strings for which K(s) > n, since otherwise a single program would have to output multiple strings. In the context of Kolmogorov complexity this would mean that the string is truly random and Kolmogorov complexity would equal the length of the original string plus a print statement [Nan10]. In theory this applies to a large portion of the strings, in practise however most strings contain some form of redundancy which can be compressed effectively.

2.3.1 Incomputability and Estimation Kolmogorov complexity is not computable, which stems from the fact that we cannot compute the output of every program [Nan10]. Fundamentally, there is no algorithm capable of predicting for every program if it will ever halt, as it has been shown by Alan Turing [Tur36]. Therefore, even if we have a short program that outputs the string and that seems to be a good candidate for being the shortest such program, there is no way of verifying that there is no shorter version available. Instead, Kolmogorov complexity is typically estimated using compression. Though probably not being the shortest version in many cases, a compression algorithm essentially produces a much shorter program that can be used to output the original string again [ACO05, CVdW04]. Thus K(s) ≈ Sc(s), where Sc is the compressed size of the object.

2.3.2 Applications Kolmogorov complexity finds many applications in algorithmic information theory, where it helps prove important results such as the incomputability of Turing’s Halting problem and Chaitin’s con- stant. Furthermore, it is used in theory of universal induction and server as a theoretical lower bound for lossless data compression [LV08]. A more practical application was proposed by Cilibrasi & Vitanyi [CV05] and Li et al. [LCL+04], clustering by compression. Their clustering methodology is based on the notion that, when two objects are similar, their total amount of information is less then when they are different. Thus, when compressing similar objects together, they can be compressed more effectively then when they are not similar. The concept of clustering is not specific to an application area and can be used to classify various things such as languages or music.

2.4 Data compression

Data compression is a process by which the file size is reduced by encoding the file data to use fewer bits of storage than the original file [MMM12]. The original file can then be recreated from the compressed representation using a reverse process called decompression.

3Source: https://wiki.haskell.org/index.php?title=Prime_numbers&oldid=36858

9 2.4.1 Compression Ratio Data compression ratio, also known as compression power or simply compression ratio, is a term used to quantify the reduction in data-representation size produced by a data compression algorithm. A higher compression ratio implies more redundant file contents and thus less information per character [CV05]. However, different algorithms result in different compression ratios. Specialised algorithms might produce very high compression ratios at the cost of being very resource intensive. The com- pression ratio CR of a file or archive is defined as its uncompressed size Su divided by its compressed size Sc [MV16]. S CR = u Sc

2.4.2 Lossless & Lossy There are several algorithms and implementations that are used for the compression of files, some of which perform better than others for certain kinds of file types [Mah12]. The algorithm of choice depends on the requirements, file type and availability of the software on the system. We can categorise these algorithms into two types of compression, lossy and lossless:

1) Lossless compression is based on information theory [Ver98] and reduces bits by identifying and eliminating statistical redundancy. It ensures that the original file can be recreated entirely, meaning there is no data loss during the compression and decompression. This type of compression is usually a must for compressing text files (such as source code), data files or other formats that should remain unchanged after the process [PK10]. From the pigeonhole principle [Her75], we can infer that no lossless compression algorithm can efficiently compress all possible data. In practise, however, text and other data are very redundant, therefore compression algorithms are designed either with a specific type of input data in mind or with specific assumptions about what kinds of redundancy the uncompressed data are likely to contain.

2) Lossy compression is based on rate-distortion theory, which is the branch of information theory [Ver98] that treats compressing the data produced by an information source down to a specified encoding rate that is strictly less than the source’s entropy [Tob03]. This means that the com- pression is an irreversible process, because the data encoding methods that are used are inexact approximations and partially discarded data to represent the content more concisely [MMM12]. Depending on the usage of the files, a certain amount of loss in quality might be acceptable to save space and resources. As a result, lossy compression is mainly used for media files such as image, audio and video files. Typically, a loss of quality for these types of files is acceptable for practical purposes.

2.4.3 Archives Files are often stored in an archive, which is a file composed of one or more files, often including some metadata [Mas06]. Two frequently used archive formats include ZIP4 and tar5. The main difference between these archives is how compression is applied to the files. In a ZIP archive each file is stored separately, allowing different files in the same archive to be compressed using different methods. Compressing the files individually allows for random-access processing; it is possible to extract them, or add new ones, without applying compression or decom- pression to the entire archive. This is convenient when individual parts of an archive have to be read or updated frequently. A directory is placed at the end of a ZIP file which identifies what files are in the ZIP and where in the archive each file is located. This allows ZIP readers to load the list of files without reading the entire ZIP archive. This format, however, adds a significant amount bytes in overhead which is not compressed [ISO15]. In comparison, in a tar archive files are essentially concatenated. Additionally, each file is preceded by a 512-byte header record and padded with zeros so that the file length is rounded up to a multiple

4https://www.iana.org/assignments/media-types/application/zip 5https://www.freebsd.org/cgi/man.cgi?query=tar&sektion=5&manpath=FreeBSD+7.0-RELEASE

10 of 512 bytes. After compression the tar format does not allow random-access processing. For large tar archives, this reduces performance, making tar archives unsuitable for situations that often require random access of individual files. However, compressing tar files typically leads to higher compression ratios, because the entire archive including headers and padding is fed to the compressor as a single entity. This way, the statistical model is not only based on a single file and can therefore remove redundancies between files. Consider two files with identical content in the same archive. If compressed individually, though, the files would both yield roughly the same compression ratio. Alternatively, by treating the archive as a single entity, we could apply compression to a single file and for the second file simply refer to the first as a whole, leading to much better compression ratio. To summarize, the sizes after compression can be vastly different, depending on what type of archive is used. Thus, in uncompressed state and without the overhead that the archive adds, the sum of the files size is equal to the archive size: X Su(A) = Su(f) Nevertheless, this does not translate to the compressed state: X Sc(A) 6= Sc(f)

11 Chapter 3

Research Method

This chapter describes the methods used to answer the research questions as mentioned in section 1.2. The questions are answered by means of empirical research and statistic analysis. In figure 3.1a schematic overview of the research is shown.

Figure 3.1: Schematic overview of the research.

3.1 Methodology

The study aims at measuring information content in source code, as we expect this to be a better indication of the intellectual effort that went into its creation. Section 2.3.1 explains that the informa- tion content of an object cannot be directly measured. Instead, various sources [CVdW04, LCL+04] suggest that information content can be estimated using a real-world compressor. Unfortunately, we are left with many questions regarding the exact methodology, such as about the characteristic projects, algorithms, archives and other filters and constraints. The exploratory part [ESSD08] of this question is already answered by other studies [ACO05, CV05], but the specific parameters for our purpose have to be further investigated. These questions are best answered by means of controlled experiments, whereby one or more independent variables are manipulated to measure their effect on a dependent variable [ESSD08]. In our case we investigate whether the compression ratio of an object depends on the compression algorithm, archive and other parameters. Next, we investigate whether the information content estimates are a viable measure for determining the expressiveness of programming languages and whether differences in programming style can predict the quality of a quality. Again, we are investigating how various variables are related and thus a controlled experiment is a good fit for our research.

3.2 Data

From the research questions mentioned in section 1.2, we can directly derive criteria for the projects that we analyse. The analysis is not limited to projects of higher quality. Selecting only these projects

12 would make the results less representative of the population and would make the normalisation of LOC measurements less meaningful and it would prohibit answering RQ4. The criteria are listed below:

1) Since we aim at measuring the functional size of a system, we are only interested in source code that actually contributes functionality to the system. Furthermore, code that requires no real effort from the programmer must also be excluded. Together this excludes test code, libraries and generated code, which are typically excluded from static code analysis as well. Since no open source benchmark repositories exist that are scoped this way and generic scoping (e.g. using a generic regular expression) would be too unreliable, we make use of projects from SIG’s software analysis warehouse, containing a mixture of mostly customers’ projects and open source projects (used for calibrations). These projects are typically scoped by SIG consultants and validated by customers. 2) The projects in the dataset must have some variety in size in terms of lines of code to exclude the possibility that this might influence the result. 3) The dataset should exclusively consist of engineering projects. In software engineering research, educational and other types of projects are considered noise which could skew the study and may lead to unrealistic, potentially inaccurate, conclusions [MKCN17]. 4) Older projects might have been designed with very different constraints than are common today (e.g. memory constraints), or may have been written in a significantly older version of language. Between one version and its successor the difference in expressiveness is probably marginal, however this may not hold for a version that is thirty years older (e.g. ++, 1985–2017). Therefore, we only analyse projects that are in active development or were within the last five years, as this covers at most three versions for most languages.

3.3 Counting Lines of Code

There are a number of different ways to measure lines of code, including, but not limited to, regular lines of code (LOC), source lines of code (SLOC), source statements (SS), logical lines of code (LLOC), effective lines of code (eLOC), etc [SBT14]. When the LOC count is used as an exact measure of functionality, using just a regular LOC count on different languages could skew the result. For exam- ple, when comparing C (abundance of curly brackets on a single line) and Python (only indentation), a regular LOC count will be difficult to compare. In such a scenario, opting for a measure such as SLOC or LLOC is sensible. Our research aims to determine expressiveness levels for various languages, with one of its intended applications to normalise LOC counts. Therefore, we select the most basic version of counting, LOC. Many brackets, whitespace and other boilerplate code are all inherent properties of the language. In theory, using a different measure for our research should yield the same result, as long as the same metric is used for the counts that are normalised. Using a more complicated definition can thus be considered adding unnecessary complexity. Additionally, we are interested in the intellectual effort that goes in writing a certain amount of lines. Thus, we only analyse the actual code, meaning files are cleaned from comments and excess whitespace. However, comments can be seen as a qualitative attribute, therefore we will maintain two versions of the files, one with comments and one without. A fitting tool for our purposes is cloc1 (Count Lines of Code); it allows us to do a regular LOC count, can strip files from comments and excess whitespace, and it is open source.

3.4 The expressiveness spectrum

Initially it might seem a good idea to compare our results to some of the existing language level tables. However, it would bring us no closer to the actual validation of the measurements. First of

1Version: 1.76, Source: https://github.com/AlDanial/cloc

13 all, we would not know which source to use, as the tables do not correlate among themselves. Also, any differences that we observe may never be attributed to either an incorrect measurement on either side. What makes validation exceedingly difficult is that the bulk of the projects that are available and meet our requirements fall within the middle segment of the expressiveness spectrum (see figure B.22), making potentially small variations appear like an inverse correlation. Instead, we base our validation on commit size analysis [Ber13] and the opinion of over 300,000 programmers [Mac16]. One could hardly call COBOL or Assembly a highly expressive programming language. Similarly, programmers agree that most general purpose languages, such as Java or C#, lie within the middle segment. And, that the highly expressive side is occupied by functional languages, like Haskell, Erlang and Scala. As long as we see these segments matched in the data, we can infer that they are a good indication of the ”true expressiveness”.

Figure 3.2: The expressiveness spectrum.

14 Chapter 4

Measuring information content

This chapter aims to find a reliable method for measuring information content in source code. As described in section 2.3.1, the information content or Kolmogorov complexity of an object cannot be measured, but must be estimated using compression. As a result, the software and methodology that we use influence the result. In particular, the size of the project, compression algorithm and type of archiving that we use can have a significant effect on the effectiveness of the compression.

4.1 Compressor and Algorithm selection

Raemaekers [Ste15] explains that, conceptually all compression algorithms are capable of removing any duplication or redundant information in code and can reduce a system closer to its net information content. Some algorithms can reach significantly higher compression ratios than others. Better compression means a closer approximation of the true information content of the object and thus instinctively one would opt for the algorithm which reaches the highest compression ratios. This is an untrue assumption, because while theoretically all items might be compressed better by another algorithm, the improvement might not be the same for all items, and thus we move away from its asymptotic value [CVdW04]. This is especially true for lossy compressors, when some information is lost after compression. Instead we should opt for an algorithm which is less specialized and has decent performance in varying circumstances. Research suggests that most general purpose compressors are suitable for our purpose [CV05, ACO05]. In these studies the algorithms mentioned in table 4.1 are proposed as candidates. For the final selection of the compressor, the following criteria listed below were taken into consideration.

Table 4.1: Candidate compressors Algorithm Implementation Version Source BWT bzip2 1.0.6 http://www.bzip.org/ LZ77 gzip 1.9 http://www.gzip.org/ LZMA2 XZ Utils 5.2.4 https://tukaani.org/xz/ PPMd 7–Zip 16.02 https://7-zip.org/

(1) Reproducibility and noise, the use of block and window sizes in the compressors aims to increasing the computation speed at the expense of the compression ratio. Across the block boundary, no repetition is detected. For smaller objects this is of no real concern, however for a system with more than ± 100KLOC the compressor’s performance suffers and is sensitive to the order of input [ACO05]. Based on this we can directly exclude bzip2 and gzip, as they may skew the result for larger projects. (2) Ability to distinguish between languages and lower variance between data points within a lan- guage. Though the latter could be interesting at a later stage, when compression may be used to

15 analyse a specific project compared to other projects in that language, rather than a language as a whole. (3) Lastly, preference is given to an algorithm which is not overly resource intensive (some compressors use over 30GB memory, or take more than four days to compress a 1GB file), as this would reduce the amount of data points that we can obtain.

4.1.1 Comparing algorithms To gain insight in what the effect is of the compressor on the result, we have compressed several open source projects in C#, Java and Python. For each pair of compressors we calculate the differences in compression ratios for each project by subtracting the results. The distribution of these differences is shown in figure 4.1. We can see that all general purpose compressors perform quite similar. The results are almost exactly proportional for most algorithm pairs. Between 7-Zip and XZ the bulk of the projects are almost exactly proportional, with the largest differences ranging from 0.18 to 1.15 percent. We can conclude that our measurements will at most fluctuate 1% by choice of either algorithm. Reviewing gzip and bzip2, we observe some outliers compared to the other compressors and among each other. These are all larger projects, confirming Alfonseca’s findings. We found that XZ performed slightly better in compression time than 7-Zip. This is not in line with some compression benchmarks1, possibly because 7-Zip is designed for Windows and only a port exists for Unix-like systems. Thus, for performance reasons, along with widespread use on most Unix-like operating systems, XZ Utils is the most sensible compressor for our purposes. Though both PPMd and LZMA based compressors could be used.

Figure 4.1: Differences in results between the compressors.

1http://mattmahoney.net/dc/text.html

16 4.2 Archive selection

As described in chapter2, there are essentially two ways of storing and compressing files in an archive: (1) Storing and compressing each file separately, such as the ZIP archive. (2) Concatenating all files into an archive and then compressing the archive as a whole, like the tar archive. The first option does not allow redundancies between files to be compressed efficiently, whereas the second does. As a result, the sizes after compression (and thus the compression ratios) can be vastly different when using a different archive. Thus, which archive is most suited for measuring information content in source code? We can further divide this question into two concerns, namely:

1) The pigeonhole principle explains that every compression algorithm is going to increase the size of some input. The file compression works by removing redundancy and thus files that contain little redundancy compress badly or not at all. Many strings even at larger lengths are truly random (i.e. strings that cannot be compressed), however man-made objects such as text or source code are much more likely to contain redundancies. This raises the question, whether a relation exists between the size of a project and its compression ratio. If so, compressing a project as a whole provides us with no new information about its information content. It would only tell us that there is perhaps more information than in a project with fewer lines and vice versa, but tells us nothing about the density of the information. 2) Compressing on a per file basis, redundant structures across files are not compressed. These structures might be necessary to build a certain functionality, because the language does not offer more concise constructs. Thus, we would not be measuring the expressiveness of the language, but rather how often redundant structures are found within a single file.

In either case the style of the programmer has some influence on the result, for the ZIP archive, a programmer who splits his code into many small files will result in a much lower compression ratio. On the other hand, for the tar archive, a programmer who produces lots of duplicated code will result in a much higher compression ratio.

4.2.1 Overhead Both archives add some overhead to the file size in the form of lookup tables, headers and file padding. Before we can determine which type of archive is suitable for our purpose, we first need to consider the effects of overhead and how we calculate compression ratio. When compressing a Zip archive, the overhead is not compressed with the rest of the files. This means that we can eliminate the effect of overhead on the result. We simply calculate size of the overhead by taking the difference between the archived and unarchived size of the files. The overhead can then directly be subtracted from the compressed size. For a tar archive this is not possible, as the bytes added as padding and headers are compressed along with the files. Figure 4.2 shows the difference in results between these two ways of calculating the compression ratio. The overhead that is added by a tar archive is very redundant (blocks in the archive are padded out to full length with zeros) and can therefore be compressed very well, leading to a higher compression ratio. The outliers in the figure are archives which contain many small files, leading to a relatively high amount of overhead. Therefore, on the uncompressed side of the equation we will use the sum of the file sizes rather than the archive size. In uncompressed state, an archive is on average 4 percent larger than the sum of its files. Determining how many bytes are added to the compressed state exactly is not possible, but is unlikely that it will have a significant effect. In a compressed state the overhead contributes very little to the overall archive size, as it is a highly redundant part of it.

17 Figure 4.2: Difference in results from the calculations. The red line resembles an overhead-less archive.

4.2.2 Comparing archives In theory, the compression ratio is not a reflection of a projects size, but rather of its information density. Thus, there should be no relation between an object’s size and its compression ratio. Our experiments show that this is not the case for the tar archive. Figure 4.3 shows that a sizeable portion of the smaller projects compress worse than larger ones. Though it may not look like it on the surface as the compression ratios appear to be independent of size, the ZIP archive shows a similar behaviour. Since the files are compressed individually and much smaller than an entire archive, their size predicts the compression ratio very well. This is reflected by the data, as the average file size of a ZIP archive predicts its compression ratio very well (r = 0.86, p < 0.001). The first thing we notice when comparing both archives is that the tar archive reaches significantly higher compression ratios than ZIP. This is of course because redundancies between files are not removed when using the ZIP archive and vice versa. As a result, the ZIP archive is unsuitable, as it is not measuring the true information content of a system. Instead, it focuses only on small parts of the system, ignoring architecture, dependencies and other relations within that system. A function that only makes a call to a ”Hello, world” function in another file would appear to be of very high information density, whereas in actuality it is a very redundant structure. The tar archive does account for these scenarios, however it does force us to cut smaller projects from the dataset.

18 Figure 4.3: Relation between lines of code and compression ratio.

4.3 Project size

Based on our analysis in section 4.2 we can conclude that smaller projects have to be cut from the data set. The smaller projects are compressed significantly worse and also display higher variance. The data shows that the larger a project, the less apparent this effect is. But, at what point exactly is a project too small to be analysed for information content? To determine the point at which the size of a project is of insignificant influence on the compres- sion ratio, we continually remove increasingly larger projects from our dataset until the Spearman correlation coefficient indicates a negligible correlation (r < 0.30). The complete results are included in AppendixA. Around 10KLOC the correlation coefficient falls below this threshold, indicating that smaller projects are not suitable for analysis. The results are visualised in figure 4.4.

19 Figure 4.4: Relation between lines of code and compression ratio. The red line indicates the cut that we have made.

4.4 Discussion

We can never know how closely compressed file size approximates the information content of that file. Thus, we can never know how accurate our measurements really are in an absolute sense, nor how suitable a compressor really is for estimating information content, other than by comparing the output to known values, or by reasoning about unexpected outliers. An exact measure is not necessary for our purposes however, because we are using it as input for relative calculations. Thus, it can still be used to find valuable information about the expressiveness of programming languages. We know from literature that the best information content estimates come from compressors with no knowledge about source code. Unfortunately it remains unclear how source code is compressed exactly, since the output is an incomprehensible binary file. This makes it practically impossible to find out what the underlying reasons for a certain compression ratio are. The effects of scoping out certain files on the compression ratio of a project has not been investigated. And the possibility of using generic scoping is left unexplored. Of course two versions of the same library slipping through the cracks would lead to a distorted compression ratio, but it would almost certainly be extreme enough to be considered an outlier. Even if this was not the case, we would still have no reason to believe the effects would not be normalised across all languages. ZIP and tar archives are both old and designed with very different purposes in mind than ours. Perhaps avoiding the use of these tools, for example by concatenating the files will lead to different results. Alternatively, a super lightweight archive, which would add minimal overhead could have been developed specifically for estimating information content. Regardless, we have shown that the effects of overhead on the compression ratio are marginal and the same for every programming language. We can reduce the effect of several noise factors, by only analysing projects larger than 10KLOC. The method that we used to determine the exact cut-off has, to the best of our knowledge, no theoretical ground and the result could be skewed by factors that we overlooked. However, we can be certain that after this point LOC and compression ratio can be considered independent variables.

20 One factor that we did consider was whether setting an upper bound as well as a lower bound would result in a higher correlation, since we are ignoring larger projects that show no correlation. The effect was marginal, for example the subset 10KLOC to 15KLOC shows no correlation. Lastly, it is also possible that the compression ratio ramp-up for languages with different expressive- ness is different, meaning we could still infer expressiveness levels from smaller projects. A requirement for this approach is a dataset that contains projects of all sizes and is also large enough for a curve to be fitted precisely. This is necessary because the distribution of compression ratios would be skewed by the number of data points of a particular size. Though we briefly experimented with this con- cept, there was insufficient data available for a meaningful answer. Additional research is required to determine whether or not this is a valid approach.

4.5 Conclusion

In this chapter, we have reviewed several parameters that influence the compression ratio of an object, with the intent of establishing a methodology for estimating information content in source code. Based on various experiments, we conclude that the most suitable approach consists of compressing source code in a tar archive, using a LZMA based compressor such as XZ. Other general purpose algorithms are significantly more resource intensive, or are sensitive to the order of input when com- pressing larger projects. The tar archive allows the removal of redundancies between files, rather than only within a single file. This makes it more suitable for defining the exact characteristics of programming languages. Calculating the compression ratio of an object (when using a tar archive) is best done using the sum of the file sizes rather than the archive size, minimizing the effects of overhead. Finally, we have determined that only projects larger than 10KLOC can be analysed. These findings form the foundation for the rest of our research, which is described in chapters5 and6.

21 Chapter 5

Expressiveness and information content

This chapter describes how we applied the methodology (see chapter4) that we developed for esti- mating information content in source code, to a large dataset. Based on the notion that information content can be used as a measure of functionality, we aim to quantify the expressiveness of vari- ous programming languages. As a practical application for these language levels, we propose a new method for normalising LOC counts.

5.1 Determining language expressiveness levels

With the exact parameters for compression and archiving in hand, we have analysed over 800 projects. All unwanted files are scoped out and the remaining ones are freed from comments and excess whites- pace. Treating all languages within a project as separate data points and filtering out archives smaller than 10KLOC, we are left with over 1300 data points. Since we have insufficient data on the high end of the spectrum, we have gathered additional data from 45 and 50 projects in Erlang and Haskell respectively. These projects are scoped using a generic scope. This means that we have less confidence in the exact expressiveness values for these languages, however we know from some of our experimentation that libraries and test code (which we may miss using generic scoping) drive the compression ratio up, rather than down. As a result, the remaining values can still be used to verify the trend, but are less useful for precise normalisation. In table 5.1, the results of the analysis are shown. A visual test (an example is shown in figure 5.1, all distributions are included in appendixB) reveals that the compression ratios for a language are roughly normally distributed. We use the Median as the nominal value for each language, as most distributions are right skewed. Some languages have a rather high standard deviation, namely XML, XSLT and XSD. This is most likely because they can be used both in a more expressive way, as a domain specific language, and in a very verbose way, as data storage.

22 Figure 5.1: Distribution of compression ratios for Java projects.

Table 5.1: Language expressiveness based on compression ratios. Language Median(CR) Mean(CR) SD(CR) Data points Haskell* 5.80 6.47 1.54 50 Erlang* 6.91 7.74 2.40 45 Python 7.18 8.81 4.10 30 JavaScript 7.37 8.77 4.71 136 TypeScript 8.72 8.96 2.08 53 Groovy 9.08 9.16 1.56 40 Pascal 9.13 9.82 2.28 17 PHP 9.22 9.46 1.38 14 Java 9.38 9.62 3.00 285 SQL 10.5 11.6 5.27 68 HTML 10.9 12.4 6.25 74 CSS 11.1 12.7 5.27 76 JSP 11.4 13.2 5.16 31 Razor 11.7 11.5 2.64 26 C# 12.4 13.3 4.17 183 VisualBasic 12.1 12.3 3.37 18 ASP.NET 12.7 13.5 9.12 24 Smalltalk 14.2 15.8 5.83 31 COBOL 18.3 19.0 7.80 28 XML 18.6 16.8 11.9 101 XSLT 21.8 20.6 10.8 36 XSD 23.3 20.4 9.20 60

*Only scoped using a regular expression.

23 Figure 5.2: Distributions of the compression ratios of various languages, sorted by median compression ratio.

24 5.2 Validation

Though we discarded the idea of validation by comparing compression ratios to existing language level tables in section 3.4, we still decided to compare them. The language level tables (SPR, SAT and QSM) contain Source Statements per Function Point values for each language. The lower this number, the fewer statements are necessary to express a function point and thus the more expressive a language is. We found no correlation between the various language level tables. Though still not significant, the best fit is shown in figure 5.3, which is a result of averages of the three tables. Instead, we compare the results to the combined opinion of over 300,000 programmers [Mac16] and expressiveness estimates based on commit size analysis [Ber13]. Here the results match our expectation, as we observe notable differences between the languages and the three segments that we described in section 3.4 reflected. Ignoring the XML, XSLT and XSD, we see COBOL on the low expressiveness side. The middle segment is mostly populated by general purpose languages, such as Java and C#. Finally, functional languages like Haskell and Erlang are on the high expressiveness side.

Figure 5.3: Compression ratios plotted against the average of the language levels (r = 0.40, p = 0.087).

5.3 Normalising LOC counts

Now that we have determined the expressiveness levels of various languages, what can we do with it? Existing language level tables can be used to convert a LOC count to a number of function points, which in turn corresponds to a number of man-months. However, our values are not bound to such a real-life counterpart.

25 The most simple application is to compare a number of lines written in one language, to a corre- sponding number of lines written in another. For example, say we have two systems: one 100 LOC (Median(CR) of 9.38) system written in Java and the other 130 LOC written in C# (Median(CR) of 12.36). 100LOC 130LOC ≈ ≈ 10.6 9.38 12.36

Here we could say that, the amount of functionality of both systems is similar at roughly 10.6 ’Units of Functionality’, but what exactly is this unit composed of? Furthermore, it is a rather difficult number to comprehend, as it has no real-life representation. Instead, we could convert the line count to another language:

LOCJava LOCC# = ∗ MedianCRC# MedianCRJava 100LOC = ∗ 12.36 ≈ 130LOC 9.38

However, we are still lacking a general way of defining the functionality of source code. As a solution to this problem, we propose the normalised lines of code (NLOC) metric. Hereby we convert the LOC count to the mean of the median compression ratios of the languages that we have available (we use the mean rather than the median of the compression ratios, because we believe that every language influences this reference point):

CRReference = Mean(MedianCRAll)

We argue this number more closely represents the programmer’s internal reference point for LOC counts. We have determined this reference point to be at 11.01. NLOC provides a uniform way of defining the size of a software system which can be applied to all programming languages. In practice, the conversion of a LOC count to NLOC would be very similar to the example above:

LOCJava NLOC = ∗ CRReference MedianCRJava 100LOC = ∗ 11.01 ≈ 117NLOC 9.38 similarly, for C#:

LOCC# NLOC = ∗ CRReference MedianCRC# 130LOC = ∗ 11.01 ≈ 115NLOC 12.4

5.4 Discussion

Though a high correlation with the language level tables is not necessarily desirable, no significant correlation means that we have general validation of our approach other than intuition. This becomes more problematic if we consider that the compression ratios may not be linear with the actual ex- pressiveness of a language. And, there is no way to determine the linearity either, other than perhaps the qualitative analysis of output files, to determine exactly how source code is compressed and how this relates to a formal definition of expressiveness. This means that even if our results are a good indication of expressiveness, results from calculating NLOC would be skewed.

26 Also, the reference compression ratio for NLOC is based on the languages that we have available, which may not necessarily be a good representation of what a programmer works most frequently with, or spends the most time working with. However, the projects that we used were mostly derived SIG’s database of benchmarking projects, which consists entirely of engineering projects and is updated yearly. In addition, a weighted average could be used, whereby languages with many projects weigh heavier than languages with fewer projects. Alternatively, we could simply opt for the five most used languages in the current market. In either case, the reference compression ratio has to be recalculated periodically. Furthermore, the standard deviation of most languages is quite low, but not that low that there exists no overlap between languages. A concisely written Java project is easily as dense as a some- what verbose Python project. This begs the question whether it is relevant to determine the exact expressiveness of the languages in the middle segment, which includes most general purpose languages. Lastly, it is doubtful that NLOC will be adopted as a sizing measure by hobbyist programmers, how- ever for medium to large scale projects, the difference between a number of lines in Python and Java matters.

5.5 Conclusion

In this chapter, we have determined expressiveness levels of various programming languages using compression. The results do not significantly correlate with existing language level tables, although they are in line with the intuition of many programmers. For example, Haskell (CR = 5.80) on the high expressive side, Java (CR = 9.38) in the middle and COBOL (CR = 18.3) on the low expressive side. This gives reason to believe that compression is viable method to determine expressiveness levels of programming languages and that it can be used to normalise lines of code measurements. As a practical application, we have proposed the normalised lines of code (NLOC) metric, which is the conversion of a LOC count to a reference expressiveness (which is the average expressiveness of the languages). We believe that this reference expressiveness closely represents the programmer’s internal reference point for LOC counts. NLOC is a uniform way to define the size of a software system and can be applied to all programming languages.

27 Chapter 6

Quality and relative verbosity

A lot of redundant or duplicated code is typically viewed as undesirable and as having a negative impact on the quality of the code base. Of course reducing the code base to a single one-liner has the opposite effect, but extreme cases aside, is there a more general trend? Are projects that are written in a relatively verbose style also of poorer quality/maintainability and vice versa? Exactly which qualitative attributes are influenced by a verbose style? In this chapter we investigate the relation between and relative verbosity.

6.1 Experiment

In chapter5, we found that the programming language of choice determines how concise a programmer can be. Nonetheless, we are not interested in the effect of the programming language, but rather in the effect of stylistic differences. Therefore, we analyse the distribution of compression ratios over individual languages (see figure 6.1). The higher the compression ratio, the lower the score on a particular metric would be, thus we expect an inverse relation. The compression ratios are mapped to different metrics of SIG’s maintainability model [VBX18], which is more recent Heitlager, Kuipers & Visser’s [HKV07] maintainability model. We limit our study to Java and C# projects, since for these languages are dataset is sufficiently large. Hereby we have analysed projects consist of at least 90 percent in a single language, because the analysis is limited to a system level. Though size may impact the maintainability of a system, we have already established that there is no relation between the size of a project and its compression ratio and will therefore exclude all size dependent metrics (Unit Size and Volume). We are left with the following metrics: overall maintainability score, duplication, unit complexity, unit interfacing, module coupling, component independence, component balance 1.

1A complete description of the metrics is available at: https://www.sig.eu/wp-content/uploads/2018/05/ 20180509-SIG-TUViT-Evaluation-Criteria-Trusted-Product-Maintainability-Guidance-for-producers.pdf

28 Figure 6.1: Distribution of compression ratios for Java projects.

Figure 6.2: Distribution of compression ratios for C# projects.

Table 6.1 and 6.2 show the correlations between the metrics and the compression rations. We observe a weak relation with duplication, which is no surprise as duplicated code is of course redundant. Other than this weak relation for both languages, we observe no relation for all other metrics, including the overall maintainability score. Additionally, we have experimented with composite scores, consisting of any combination of the other metrics, but with no significant results. Thus, a more verbose style does not imply that a project is of poor quality.

29 Table 6.1: Quality relations Java Metric Pearson Corr. p-value Coefficient Overall Maintainability -0.18 0.029 Duplication -0.22 0.005 Unit Complexity -0.11 0.19 Unit Interfacing -0.11 0.18 Module Coupling -0.11 0.18 Component Independence -0.08 0.31 Component Balance -0.03 0.72

Table 6.2: Quality relations C# Metric Pearson Corr. p-value Coefficient Overall Maintainability -0.29 0.004 Duplication -0.40 <0.001 Unit Complexity -0.24 0.013 Unit Interfacing -0.11 0.28 Module Coupling -0.25 0.013 Component Independence -0.08 0.31 Component Balance -0.12 0.27

6.2 Discussion

Though we still believe style and other coding habits inherently influence the quality of source code, it is not directly measurable from compression ratios. Why this relation does not show in compression ratios is unclear. Since we only have two languages that are also quite similar, it is possible that they just happen to not correlate very well with these metrics. The inclusion of comments in source code leads to a lower compression ratio (natural language is less redundant than source code) and is often seen as a positive influence on the code’s quality. However, we observed no significant difference between the results with or without the inclusion of comments.

6.3 Conclusion

In this chapter we searched for a relation between stylistic differences and the quality of a project. We found only a weak correlation with one of the metrics (duplication) in one of the two languages that we analysed. Thus, we can conclude that verbosity relative to other projects in a language does not predict other qualitative attributes of the project. There are many reasons why this could be the case, not unlikely because such a relation does not exist. Though, perhaps further research with other languages and new metrics will show otherwise.

30 Chapter 7

Related Work

The use of compression as a means to estimate Kolmogorov complexity find several applications. Our works focusses primarily on defining the exact characteristics of various programming languages and to more accurately determine the functional size of software. Studies have been performed towards the use of compression to cluster objects and more closely related to our work, to calculate software productivity or determine the complexity of a software system. In this chapter we briefly cover some of the these studies.

7.1 Normalised Compression Distance

Normalized compression distance is a way of measuring the similarity between two objects [CV05, CVdW04, LCL+04]. It has been successfully used to cluster languages, music and text, though theoretically it could be applied to any pair of objects, even source code, software systems, etc. Normalised compression distance is based on Kolmogorov complexity [LV08], where if two similar objects are compressed together, their total information content is much less than two different objects. This translates very well to estimates using compression, as the shared and redundant information is compressed effectively, leading to a much higher compression ratio when objects are similar. Research on this topic has opened the gates for our approach, where among other things, the effects of compression algorithms and other noise factors for estimating information content have been investigated [ACO05].

7.2 Calculating software productivity with compression

Raemaekers [Ste15] studied the use of compression and information content to express LOC in a certain amount of work in monetary terms or in time. Comparing his results to rebuild value and to estimates for the average number of lines of code written per programmer per hour in a certain programming language. Concluding that compression can be used as a software cost estimation method and has successfully calculated average churn figures for a set of programming languages. Raemaekers’ work is more oriented towards a system level approach, rather than defining the exact properties of programming languages. Furthermore, we have crystallised the methodology necessary for making accurate information content estimates in source code (see chapter4). Investigating and addressing several issues in the experimental design.

7.3 Determining software complexity

In 2016, IBM [Lak16] patented a method for determining software complexity using compression. The patent does not reveal the internals of their implementation, nor what the exact intent of the measurement is, though it refers to cyclomatic complexity, or McCabe complexity [McC76]. McCabe complexity is often criticised, because a would correlate with lines of code. Though a study

31 by Landman, Serebrenik & Vinju [LSV14] shows that this is only the case on a system or class level, not when applied on a method level. McCabe complexity is slightly more reasonable measure than lines of code when considering the working memory of humans and is arguably a more interesting measure in the context of productivity. Why there is a need for a compression based approximation of McCabe complexity remains unclear. McCabe complexity can easily be counted in a system, by counting certain keywords and constructs in the source code. The only requirement for this is that the language is known beforehand.

32 Chapter 8

Conclusion

In this work we set out determine expressiveness levels of various programming languages, in order to measure the size of a system independently of language. Our approach focuses on measuring information content in source code using compression. Initially, we reviewed several parameters that influence the compression ratio of an object, with the intent of establishing a methodology for estimating information content in source code. Based on various experiments, we conclude that the most suitable approach consists of compressing source code in a tar archive, using an LZMA based compressor, such as XZ. Other general purpose algorithms are significantly more resource intensive, or are sensitive to the order of input when compressing larger projects. Furthermore, we determined that only projects larger than 10KLOC can be analysed, all of which formed the foundation for the rest of our research. Next, we successfully determined expressiveness levels of various programming languages. The results do not significantly correlate with existing language level tables, although they are in line with the intuition of many programmers, indicating that compression is an effective way determine the expressiveness of programming languages. As a practical application, we have proposed the normalised lines of code (NLOC) metric, which is a uniform way to define the size of a software system and can be applied to all programming languages. Finally, we investigated the relation between various quality metrics of source code and the relative verbosity of a project. We found only a weak correlation with a single metric (duplication), thus concluding that compression ratio is not a useful metric for determining the quality of a project.

8.1 Future work

To provide further research on this topic with a larger and more easily accessible dataset, future studies should prioritize to determine the effects of generic scoping on the results. The results could open the gates for further research on the validation of the expressiveness levels and determining the relation between the compression ratios and the true expressiveness of programming languages. The aforementioned suggestions are all aimed at making data more accessible and solidifying the data that we already have, in order to establish the foundation for studies to map more precise size measurements to effort and productivity. Lastly, the methodology for measuring information content in source code is not limited to ex- pressiveness levels. It can instead be used in a similar way to other applications of Kolmogorov complexity, specifically clustering. For example, reference architectures could be compressed together with a system to determine what the underlying structure is, or how compliant it is.

33 Acknowledgements

No great work has ever been accomplished by a single mind. I would thus like to express my gratitude to Software Improvement Group and Lodewijk Bergmans for giving me the chance to work on this research and for the great supervision I have received. Not to forget my fellow interns, who have provided me with critical feedback and.. graphs. Another great thanks goes to my supervisor at the University of Amsterdam, Ana Oprescu, whose comments and feedbacks have been very precious. Finally, I will take the time to thank Luisa Seguin, for her suggestions, support and sharp editing eyes.

34 Bibliography

[ACO05] Manuel Alfonseca, Manuel Cebri´an, and Alfonso Ortega. Common Pitfalls Using the Normalized Compression Distance: What to Watch Out for in a Compressor. Communications in Information and Systems, 5(4):367–384, 2005. URL: http://www.intlpress.com/site/pub/pages/journals/items/cis/ content/vols/0005/0004/a001/, doi:10.4310/CIS.2005.v5.n4.a1. [AG83] A.J. Albrecht and J.E. Gaffney. Software Function, Source Lines of Code, and Devel- opment Effort Prediction: A Software Science Validation. IEEE Transactions on Soft- ware Engineering, SE-9(6):639–648, nov 1983. URL: http://ieeexplore.ieee.org/ document/1703110/, doi:10.1109/TSE.1983.235271. [Ber13] Donnie Berkholz. Programming languages ranked by expressive- ness, 2013. URL: https://redmonk.com/dberkholz/2013/03/25/ programming-languages-ranked-by-expressiveness/. [BH91] David Bergantz and Johnette Hassell. Information relationships in prolog programs: how do programmers comprehend functionality? International Journal of Man-Machine Stud- ies, 35(3):313 – 328, 1991. URL: http://www.sciencedirect.com/science/article/ pii/S0020737305801312, doi:https://doi.org/10.1016/S0020-7373(05)80131-2. [Con90] Michael J. Conolley. An Empirical Study of Function Point Analysis Reliability. 1(1):1– 186, 1990. [CV05] Rudi Cilibrasi and P.M.B. Vitanyi. Clustering by Compression. IEEE Transactions on Information Theory, 51(4):1523–1545, apr 2005. URL: http://ieeexplore.ieee.org/ document/1412045/, arXiv:0312044, doi:10.1109/TIT.2005.844059. [CVdW04] Rudi Cilibrasi, Paul Vit´anyi, and Ronald de Wolf. Algorithmic Clustering of Music Based on String Compression. Computer Music Journal, 28(4):49–67, dec 2004. URL: https://muse.jhu.edu/article/176111http://www.mitpressjournals. org/doi/10.1162/0148926042728449, doi:10.1162/0148926042728449. [ESSD08] Steve Easterbrook, Janice Singer, Margaret-Anne Storey, and Daniela Damian. Selecting Empirical Methods for Software Engineering Research, pages 285–311. Springer Lon- don, London, 2008. URL: https://doi.org/10.1007/978-1-84800-044-5_11, doi: 10.1007/978-1-84800-044-5_11. [Fel91] Matthias Felleisen. On the expressive power of programming languages. Science of Com- puter Programming, 17(1-3):35–75, dec 1991. URL: http://linkinghub.elsevier.com/ retrieve/pii/016764239190036W, doi:10.1016/0167-6423(91)90036-W. [GE06] Daniel D. Galorath and Michael W. Evans. Software Sizing, Estimation, and Risk Man- agement: When Performance is Measured Performance Improves. Auerbach Publica- tions, Taylor & Francis Group, LLC, New York, NY, 1 edition, 2006. URL: http: //books.google.com/books?id=MQL45{_}XhyHYC{&}pgis=1. [Her75] Israel N. Herstein. Topics in Algebra. John Wiley & sons, New York, NY, 2 edition, 1975.

35 [HKV07] I. Heitlager, T. Kuipers, and J. Visser. A practical model for measuring maintainabil- ity. In 6th International Conference on the Quality of Information and Communications Technology (QUATIC 2007), pages 30–39, Sept 2007. doi:10.1109/QUATIC.2007.8. [HPM03] Lawrence H. Putnam and Ware Myers. Five core metrics : The intelligence behind successful software management / l.h. putnam, w. myers. pages 367 – 375, jan 2003. [ISO15] Information technology – Document Container File – Part 1: Core. Standard, Geneva, CH, October 2015. URL: https://www.iso.org/standard/60101.html. [Ive62] Kenneth E. Iverson. A programming language, 1962. URL: http://dl.acm.org/ citation.cfm?id=1460872. [Jon86] Capers Jones. Programming Productivity. McGraw-Hill, Inc., New York, NY, USA, 1986. [Jon94] Casper Jones. Software metrics: good, bad and missing. Computer, 27(9):98–100, Sept 1994. doi:10.1109/2.312055. [Jon95] Casper Jones. Backfiring: converting lines of code to function points. Computer, 28(11):87–88, Nov 1995. doi:10.1109/2.471193. [Lak16] John M. Lake. Determining software complexity, Mar 2016. US Patent 9,299,045 B2. URL: https://patentimages.storage.googleapis.com/a1/d4/f1/e8b77e4e5746c4/ US9299045.pdf. [LCL+04] M. Li, Xin Chen, Xin Li, B. Ma, and P.M.B. Vitanyi. The Similarity Metric. IEEE Trans- actions on Information Theory, 50(12):3250–3264, dec 2004. URL: http://ieeexplore. ieee.org/document/1362909/, doi:10.1109/TIT.2004.838101. [LJ90] G. Low and D. Jeffery. Function points in the estimation and evaluation of the software process. IEEE Transactions on Software Engineering, 16:64–71, 01 1990. URL: doi. ieeecomputersociety.org/10.1109/32.44364, doi:10.1109/32.44364. [LSV14] Davy Landman, Alexander Serebrenik, and Jurgen Vinju. Empirical analysis of the re- lationship between cc and sloc in a large corpus of java methods. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, IC- SME ’14, pages 221–230, Washington, DC, USA, 2014. IEEE Computer Society. URL: http://dx.doi.org/10.1109/ICSME.2014.44, doi:10.1109/ICSME.2014.44. [LV08] Ming Li and Paul Vit´anyi. An Introduction to Kolmogorov Complexity and Its Appli- cations. Texts in Computer Science. Springer New York, New York, NY, 3 edition, 2008. URL: http://link.springer.com/10.1007/978-0-387-49820-1, arXiv:arXiv: 1011.1669v3, doi:10.1007/978-0-387-49820-1. [Mac16] David R. Maclver. This language is expressive, 2016. URL: http://hammerprinciple. com:80/therighttool/statements/this-language-is-expressive. [Mah12] Salauddin Mahmud. An Improved Data Compression Method for General Data. Inter- national Journal of Scientific & Engineering Research Volume, 3(3):1–4, 2012. [Mas06] Julien Masan´es. Web Archiving: Issues and Methods, pages 1–53. Springer Berlin Heidel- berg, Berlin, Heidelberg, 2006. URL: https://doi.org/10.1007/978-3-540-46332-0_ 1, doi:10.1007/978-3-540-46332-0_1. [McC76] T. J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308–320, Dec 1976. doi:10.1109/TSE.1976.233837. [Mic16] Greg Michaelson. Are there domain specific languages? In Proceedings of the 1st Inter- national Workshop on Real World Domain Specific Languages, RWDSL ’16, pages 1:1– 1:3, New York, NY, USA, 2016. ACM. URL: http://doi.acm.org.proxy.uba.uva.nl: 2048/10.1145/2889420.2892271, doi:10.1145/2889420.2892271.

36 [MKCN17] Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. Curating GitHub for engineered software projects. Empirical Software Engineering, 22(6):3219– 3253, dec 2017. URL: http://link.springer.com/10.1007/s10664-017-9512-6, doi: 10.1007/s10664-017-9512-6. [MMM12] Omar Adil Mahdi, Mazin Abed Mohammed, and Ahmed Jasim Mohamed. Implementing a Novel Approach an Convert Audio Compression to Text Coding via Hybrid Technique. IJCSI International Journal of Computer Science Issues, 9(3):53–59, 2012. URL: http: //ijcsi.org/papers/IJCSI-9-6-3-53-59.pdf. [MR13] Leo A. Meyerovich and Ariel S. Rabkin. Empirical analysis of programming lan- guage adoption. Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications - OOPSLA ’13, pages 1–18, 2013. URL: http://dl.acm.org/citation.cfm?doid=2509136.2509515, doi: 10.1145/2509136.2509515. [MV16] S. Mittal and J. S. Vetter. A survey of architectural approaches for data compression in cache and main memory systems. IEEE Transactions on Parallel and Distributed Systems, 27(5):1524–1536, May 2016. doi:10.1109/TPDS.2015.2435788. [Nan10] Volker Nannen. A Short Introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length (MDL). arXiv preprint arXiv:1005.2364, (Mdl):1–23, 2010. URL: http://arxiv.org/abs/1005.2364, arXiv:1005.2364. [PK10] Jagadish H. Pujar and Lohit M. Kadlaskar. A new lossless method of image compression and decompression using huffman coding techniques. Journal of Theoretical and Applied Information Technology, 15(1):18–23, 2010. [PM07] Roger S. Pressman and Bruce R. Maxim. Software engineering. ACM SIGSOFT Software Engineering Notes, 32(1):4, jan 2007. URL: http://portal.acm.org/citation.cfm? doid=1226816.1226822, doi:10.1145/1226816.1226822. [SBT14] Istv´anSiket, Arp´adBesz´edes,and´ John Taylor. Differences in the Definition and Cal- culation of the LOC Metric in Free Tools. Technical report, University of Szeged, De- partment of Software Engineering, Szeged, 2014. URL: http://www.inf.u-szeged.hu/ {~}beszedes/research/SED-TR2014-001-LOC.pdf. [Sse12] ISO/IEC JTC 1/SC 7 Software and systems engineering. Information technology – soft- ware measurement – functional size measurement – part 6: Guide for use of iso/iec 14143 series and related international standards. Iso, International Organization for Standard- ization, Geneva, Switzerland, 2012. URL: https://www.iso.org/standard/60176.html. [Sta07] Ludwig Staiger. The Kolmogorov complexity of infinite words. Theoretical Computer Science, 383(2-3):187–199, 2007. doi:10.1016/j.tcs.2007.04.013. [Ste15] Steven B. A. Raemaekers. Origin, Impact and Cost of Interface Instability. PhD thesis, TU Delft, 2015.

[Tel94] A. Teller. Turing completeness in the language of genetic programming with indexed memory. In Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence, pages 136–141 vol.1, Jun 1994. doi:10.1109/ICEC.1994.350027. [Tob03] Berger Toby. RateDistortion Theory. American Cancer Society, 2003. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/0471219282.eot142, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot142, doi:10.1002/0471219282.eot142.

37 [Tur36] Alan M. Turing. On computable numbers, with an application to the entschei- dungsproblem. Proceedings of the London Mathematical Society, s2-42(1):230– 265, 1936. URL: https://londmathsoc.onlinelibrary.wiley.com/doi/abs/10. 1112/plms/s2-42.1.230, arXiv:https://londmathsoc.onlinelibrary.wiley.com/ doi/pdf/10.1112/plms/s2-42.1.230, doi:10.1112/plms/s2-42.1.230. [VBX18] Joost Visser, Dennis Bijlsma, and Haiyun Xu. SIG/TViT Evaluation criteria trusted product maintainability: Guidance for producers. Technical report, Software Improvement Group (SIG) and TViT, 05 2018. URL: https://www.sig.eu/resources/sig-models/. [Ver98] S. Verdu. Fifty years of Shannon theory. IEEE Transactions on Information Theory, 44(6):2057–2078, 1998. URL: http://ieeexplore.ieee.org/document/720531/, doi: 10.1109/18.720531.

38 Appendices

39 Appendix A

Cut-off Data

>KLOC Spearman p-value >KLOC Spearman p-value Corr. Coefficient Corr. Coefficient 1 0.48 <0.001 26 0.11 0.045 2 0.41 <0.001 27 0.09 0.132 3 0.36 <0.001 28 0.09 0.118 4 0.33 <0.001 29 0.07 0.209 5 0.31 <0.001 30 0.04 0.523 6 0.31 <0.001 31 0.06 0.292 7 0.29 <0.001 32 0.07 0.253 8 0.31 <0.001 33 0.03 0.612 9 0.30 <0.001 34 0.00 0.948 10 0.26 <0.001 35 0.01 0.931 11 0.20 <0.001 36 0.00 0.963 12 0.18 <0.001 37 0.04 0.535 13 0.20 <0.001 38 0.05 0.473 14 0.19 <0.001 39 -0.03 0.642 15 0.19 <0.001 40 -0.01 0.926 16 0.20 <0.001 41 0.02 0.799 17 0.23 <0.001 42 0.06 0.395 18 0.25 <0.001 43 0.05 0.482 19 0.24 <0.001 44 0.06 0.460 20 0.24 <0.001 45 0.07 0.358 21 0.23 <0.001 46 0.07 0.385 22 0.23 <0.001 47 0.04 0.608 23 0.19 <0.001 48 0.05 0.522 24 0.14 0.006 49 0.02 0.844 25 0.15 0.007 50 0.02 0.846

40 Appendix B

Language Distributions

Figure B.1: Distribution of compression ratios for Haskell projects.

41 Figure B.2: Distribution of compression ratios for Erlang projects.

Figure B.3: Distribution of compression ratios for Python projects.

42 Figure B.4: Distribution of compression ratios for JavaScript projects.

Figure B.5: Distribution of compression ratios for TypeScript projects.

43 Figure B.6: Distribution of compression ratios for Groovy projects.

Figure B.7: Distribution of compression ratios for Pascal projects.

44 Figure B.8: Distribution of compression ratios for PHP projects.

Figure B.9: Distribution of compression ratios for Java projects.

45 Figure B.10: Distribution of compression ratios for SQL projects.

Figure B.11: Distribution of compression ratios for HTML projects.

46 Figure B.12: Distribution of compression ratios for CSS projects.

Figure B.13: Distribution of compression ratios for Razor projects.

47 Figure B.14: Distribution of compression ratios for C# projects.

Figure B.15: Distribution of compression ratios for VisualBasic projects.

48 Figure B.16: Distribution of compression ratios for ASP.NET projects.

Figure B.17: Distribution of compression ratios for VisualBasic projects.

49 Figure B.18: Distribution of compression ratios for Smalltalk projects.

Figure B.19: Distribution of compression ratios for COBOL projects.

50 Figure B.20: Distribution of compression ratios for XML projects.

Figure B.21: Distribution of compression ratios for XSLT projects.

51 Figure B.22: Distribution of compression ratios for XSD projects.

52