An Empirical Cybersecurity Evaluation of Github Copilot's Code Contributions

An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri Abstract—There is burgeoning interest in designing AI-based model currently available, it is important to understand: Are systems to assist humans in designing computing systems, Copilot’s suggestions commonly insecure? What is the including tools that automatically generate computer code. prevalence of insecure generated code? What factors of the The most notable of these comes in the form of the first self- described ‘AI pair programmer’, GitHub Copilot, a language “context” yield generated code that is more or less secure? model trained over open-source GitHub code. However, code We systematically experiment with Copilot to gain insights often contains bugs—and so, given the vast quantity of unvetted into these questions by designing scenarios for Copilot to code that Copilot has processed, it is certain that the language complete and by analyzing the produced code for security model will have learned from exploitable, buggy code. This weaknesses. As our corpus of well-defined weaknesses, we raises concerns on the security of Copilot’s code contributions. In this work, we systematically investigate the prevalence and check Copilot completions for a subset of MITRE’s “2021 conditions that can cause GitHub Copilot to recommend insecure CWE Top 25 Most Dangerous Software Weaknesses” [4], code. To perform this analysis we prompt Copilot to generate a list that is updated yearly to indicate the most dangerous code in scenarios relevant to high-risk CWEs (e.g. those from software weaknesses as measured over the previous two MITRE’s “Top 25” list). We explore Copilot’s performance on calendar years. The AI’s documentation recommends that three distinct code generation axes—examining how it performs given diversity of weaknesses, diversity of prompts, and diversity one uses “Copilot together with testing practices and security of domains. In total, we produce 89 different scenarios for tools, as well as your own judgment”. Our work attempts Copilot to complete, producing 1,692 programs. Of these, we to characterize the tendency of Copilot to produce insecure found approximately 40 % to be vulnerable. code, giving a gauge for the amount of scrutiny a human Index Terms—Cybersecurity, AI, code generation, CWE developer might need to do for security issues. We study Copilot’s behavior along three dimensions: (1) I. INTRODUCTION diversity of weakness, its propensity for generating code that With increasing pressure on software developers to produce is susceptible to each of weaknesses in the CWE top 25, given code quickly, there is considerable interest in tools and a scenario where such a vulnerability is possible; (2) diversity techniques for improving productivity. The most recent of prompt, its response to the context for a particular scenario entrant into this field is machine learning (ML)-based (SQL injection), and (3) diversity of domain, its response to code generation, in which large models originally designed the domain, i.e., programming language/paradigm. for natural language processing (NLP) are trained on vast For diversity of weakness, we construct three different quantities of code and attempt to provide sensible completions scenarios for each applicable top-25 CWE and use CodeQL [5] as programmers write code. In June 2021, GitHub released along with manual inspection to assess whether the Copilot [1], an “AI pair programmer” that generates code in suggestions returned are vulnerable to that CWE. Our goal a variety of languages given some context such as comments, here is to get a broad overview of the types of vulnerability function names, and surrounding code. Copilot is built on a Copilot is most likely to generate, and how often users might large language model that is trained on open-source code [2] encounter such insecure suggestions. Next, we investigate including “public code...with insecure coding patterns”, thus the effect different prompts have on how likely Copilot is to giving rise to the potential for “synthesize[d] code that arXiv:2108.09293v2 [cs.CR] 23 Aug 2021 return suggestions that are vulnerable to SQL injection. This contains these undesirable patterns” [1]. investigation allows us to better understand what patterns Although prior research has evaluated the functionality of programmers may wish to avoid when using Copilot, or ways code generated by language models [3], [2], there is no to help guide it to produce more secure code. systematic examination of the security of ML-generated code. Finally, we study the security of code generated by Copilot As GitHub Copilot is the largest and most capable such when it is used for a domain that was less frequently seen H. Pearce is with the Dept. of Electrical and Computer Engineering, New in its training data. Copilot’s marketing materials claim that York University, NY 11201 USA (e-mail: [email protected]) it speaks “all the languages one loves.” To test this claim, we B. Ahmad is with the Dept. of Electrical and Computer Engineering, New focus on Copilot’s behavior when tasked with a new domain York University, NY 11201 USA (e-mail: [email protected]) B. Tan is is with the Dept. of Electrical and Computer Engineering, New added to the MITRE CWEs in 2020—hardware-specific York University, NY 11201 USA (e-mail: [email protected]) CWEs [6]. As with the software CWEs, hardware designers B. Dolan-Gavitt is with the Dept. of Computer Science and Engineering, can be sure that their designs meet a certain baseline level New York University, NY 11201 USA (e-mail: [email protected]) R. Karri is with the Dept. of Electrical and Computer Engineering, New of security if their designs are free of hardware weaknesses. York University, NY 11201 USA (e-mail: [email protected]) We are interested in studying how Copilot performs when tasked with generating register-transfer level (RTL) code in was limited to one type of weakness (insecure crypto the hardware description language Verilog. parameters, namely short RSA key sizes and using AES in Our contributions include the following. We perform ECB mode). The authors note that “a larger study using the automatic and manual analysis of Copilot’s software and most common insecure code vulnerabilities” is needed, and hardware code completion behavior in response to “prompts” we supply such an analysis here. handcrafted to represent security-relevant scenarios and An important feature that Codex and Copilot inherit from characterize the impact that patterns in the context can have GPT-3 is that, given a prompt, they generate the most likely on the AI’s code generation and confidence. We discuss completion for that prompt based on what was seen during implications for software and hardware designers, especially training. In the context of code generation, this means that security novices, when using AI pair programming tools. the model will not necessarily generate the best code (by This work is accompanied by the release of our repository of whatever metric you choose—performance, security, etc.) but security-relevant scenarios (see the Appendix). rather the one that best matches the code that came before. As a result, the quality of the generated code can be strongly II. BACKGROUND AND RELATED WORK influenced by semantically irrelevant features of the prompt. A. Code Generation We explore the effect of different prompts in Section V-C. Software development involves the iterative refinement of a (plain language) specification into a software B. Evaluating Code Security implementation—developers write code, comments, and Numerous elements determine the quality of code. Code other supporting collateral as they work towards a functional generation literature emphasizes functional correctness, product. Early work proposed ML-based tools to support measured by compilation and checking against unit tests, developers through all stages of the software design life-cycle or using text similarity metrics to desired responses [2]. (e.g., predicting designer effort, extracting specifications [7]). Unlike metrics for functional correctness of generated code, With recent advancements in the domain of deep learning (DL) evaluating the security of code contributions made by Copilot and NLP, sophisticated models can perform sophisticated is an open problem. Aside from manual assessment by a interventions on a code base, such as automated program human security expert there are myriad tools and techniques repair [8]. In this work, we focus on Copilot as an “AI to perform security analyses of software [20]. Source code pair programmer” that offers a designer code completion analysis tools, known as Static Application Security Testing suggestions in “real-time” as they write code in a text editor. (SAST) Tools, are designed to analyze source code and/or There are many efforts to automatically translate compiled versions of code to find security flaws; typically specifications into computer code for natural language they specialize on identifying a specific vulnerability class. programming [9], through formal models for automatic code In this work, we gauge the security of Copilot’s contri- generation (e.g., [10], [11]) or via machine-learned NLP butions using a mix of automated analysis using GitHub’s [12]. DL architectures that demonstrate good fits for NLP CodeQL tool [5] (as it can scan for a wider range of security include LSTMs [13], RNNs [14], and Transformers [15] that weaknesses in code compared to other tools) alongside our have paved the way for models such as BERT [16], GPT- manual code inspection. CodeQL is open-source and supports 2 [17], and GPT-3 [18]. These models can perform language the analysis of software in languages such as Java, JavaScript, tasks such as translation and answering questions from the C++, C#, and Python. Through queries written in its QL query CoQA [19] dataset; after fine-tuning on specialized datasets, language, CodeQL can find issues in codebases based on a set the models can undertake tasks such as code completion [2].

Load more