How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories Michael Meli Matthew R. McNiece Bradley Reaves North Carolina State University North Carolina State University North Carolina State University
[email protected] Cisco Systems, Inc.
[email protected] [email protected] Abstract—GitHub and similar platforms have made public leaked in this way have been exploited before [4], [8], [21], [25], collaborative development of software commonplace. However, a [41], [46]. While this problem is known, it remains unknown to problem arises when this public code must manage authentication what extent secrets are leaked and how attackers can efficiently secrets, such as API keys or cryptographic secrets. These secrets and effectively extract these secrets. must be kept private for security, yet common development practices like adding these secrets to code make accidental leakage In this paper, we present the first comprehensive, longi- frequent. In this paper, we present the first large-scale and tudinal analysis of secret leakage on GitHub. We build and longitudinal analysis of secret leakage on GitHub. We examine evaluate two different approaches for mining secrets: one is able billions of files collected using two complementary approaches: a to discover 99% of newly committed files containing secrets in nearly six-month scan of real-time public GitHub commits and a public snapshot covering 13% of open-source repositories. We real time, while the other leverages a large snapshot covering focus on private key files and 11 high-impact platforms with 13% of all public repositories, some dating to GitHub’s creation. distinctive API key formats. This focus allows us to develop We examine millions of repositories and billions of files to conservative detection techniques that we manually and automat- recover hundreds of thousands of secrets targeting 11 different ically evaluate to ensure accurate results.