SGL: a Domain-Specific Language for Large-Scale Analysis of Open-Source

SGL: A domain-specific language for large-scale analysis of open-source code Darius Foo Ang Ming Yi Jason Yeo Asankhaya Sharma SourceClear, Inc. SourceClear, Inc. SourceClear, Inc. SourceClear, Inc. [email protected] [email protected] [email protected] [email protected] Abstract—Today software is built in fundamentally different We believe that an automated way to audit the open-source ways from how it was a decade ago. It is increasingly common ecosystem – by cataloguing existing vulnerabilities and bugs, for applications to be assembled out of open-source components, as well as discovering new ones – is essential to being able resulting in the use of large amounts of third-party code. This third-party code is a means for vulnerabilities to make their to use open-source safely. way downstream into applications. Recent vulnerabilities such To this end, we describe the Security Graph Language as Heartbleed, FREAK SSL/TLS, GHOST, and the Equifax data (SGL), a domain-specific language (DSL) for analyzing graph- breach (due to a flaw in Apache Struts) were ultimately caused structured datasets of open-source code. SGL allows users to by third-party components. We argue that an automated way to express complex queries on relations between libraries and audit the open-source ecosystem, catalog existing vulnerabilities, and discover new flaws is essential to using open-source safely. vulnerabilities: for example, taking the transitive closure of a To this end, we describe the Security Graph Language (SGL), a calls relation in a call graph to discover if method calls from domain-specific language for analyzing graph-structured datasets a library reach a sink known to be associated with a given of open-source code and cataloguing vulnerabilities. SGL allows vulnerability. users to express complex queries on relations between libraries SGL doubles as a vulnerability description language. We and vulnerabilities in the style of a program analysis language. represent vulnerabilities intrinsically, as SGL queries. This SGL queries double as an executable representation for vulnerabilities, allowing vulnerabilities to be automatically checked allows them to be deduplicated based on the structural types against a database and deduplicated using a canonical represen- of queries, providing a means of maintaining the quality of tation. We outline a novel optimization for SGL queries based on vulnerability datasets. This representation also allows vul- regular path query containment, improving query performance nerabilities to be easily verified against a given dataset by by up to 3 orders of magnitude. We also demonstrate the executing them. effectiveness of SGL in practice to find zero-day vulnerabilities by identifying several flaws in the open-source version of Oracle SGL is implemented by compilation to Gremlin, a graph GlassFish Server. traversal language from the Apache TinkerPop [24] project. This allows SGL to use any TinkerPop-based graph database I. INTRODUCTION as a back-end; our own implementation uses the DataStax The adoption of DevOps practices and sophisticated de- Enterprise Graph database. ployment tools have made it straightforward to release and Our main technical contributions are: consume software packages. • The implementation of a DSL capable of both describing Developers assemble large applications using off-the-shelf vulnerabilities and representing complex queries on real- open-source components and libraries, which are distributed world vulnerability datasets. through centralized repositories such as Maven Central, NPM, • A strategy for vulnerability duplication checking based RubyGems, and PyPI. Much of the busywork of downloading on computing structural types for queries, leading us sources and negotiating package versions is automated by to the idea of a canonical, executable representation for dependency management tools like npm, pip, or gem.A vulnerabilities. single install command can pull in hundreds of libraries, • A novel means of optimizing special cases of SGL demonstrating how easily large volumes of third-party code queries for runtime performance based on regular path can be included in software projects. query containment. This can improve query performance There are many benefits to using open-source libraries: low by up to 3 orders of magnitude. cost, code reuse, and the flexibility to customize it to one’s This paper is structured as follows: needs. However, unaudited third-party code is also a means for • Section II covers the problems SGL solves in greater flaws and vulnerabilities to make their way downstream into depth and compares SGL with prior work. applications. Recent vulnerabilities – Heartbleed [6], FREAK • In Section III, we provide an overview of Gremlin. SSL/TLS [8], and GHOST [9] – were due to bugs in popular • In Section IV, we describe the syntax and semantics of open-source libraries. The recent Equifax data breach [12], the SGL. largest in history, was also due to a known vulnerability in the • In Section V, we cover SGL’s type system and how it Apache Struts web framework. supports canonical vulnerability representation. • In Section VI, we describe how SGL queries may be security. We hope to grow the language in these two areas in rewritten and optimized. future. • In Section VII we evaluate the performance of rewritten SGL queries and include a case study on identifying zero- B. Canonical vulnerability representation day vulnerabilities in the open-source version of Oracle SGL is also intended to give vulnerabilities a canonical, GlassFish Server. executable representation. This gives it a secondary use as a • We conclude by discussing areas for future improvement vulnerability description language that works in a distributed in Section VIII. setting. Vulnerabilities are possible only under specific conditions, II. MOTIVATION require specific versions of components, and have varying and relative severity. Without some form of identifier, it is difficult A. Large-scale program analysis and graph qeuries to pinpoint the particular vulnerability one is discussing and SGL is a DSL for program analysis. It is meant to express get it the attention it requires. relations involving open-source libraries, their file contents, Centralized vulnerability databases (e.g. NVD) are the de and associated vulnerabilities. Examples of domain objects are facto means of cataloging flaws and bugs in software compo- files and file hashes (library-level), as well as methods, classes, nents. They assign nominal identifiers (e.g. CVE-2017-1234) and bytecode hashes (source-level). The full schema is shown to instances of vulnerabilities, giving researchers a way to refer in Figure 3. to them. Source-level granularity is required because vulnerabilities The CVE format is not completely machine-readable. Com- ultimately originate in programs, and an understanding of ponents that a CVE pertains to are identified inconsistently, program semantics is required to accurately analyze their leading to false positives when trying to assess if a CVE is effects. For example, to determine if a software project inherits relevant to some real-world system. a vulnerability from an open-source library it uses, one might DSLs meant to add structure to CVEs exist; OVAL [3] is check if it calls a vulnerable method (i.e. a known sink perhaps the most widely-adopted standard for vulnerability identified by a vulnerability research team for that particular description. It uses information derived from CVEs and CPEs vulnerability). This entails building a call graph [1][2], pos- to identify vulnerable system configurations. Other efforts in sibly also performing a data-flow analysis. The purpose of the space are OASIS AVDL [4], for describing web application doing this at scale is to discover zero-day vulnerabilities from exploits, and VuXML [5], which captures vulnerabilities in inter-library relationships. software packages for FreeBSD. A useful representation of an entire open-source ecosystem, As a vulnerability description language, SGL is focused on from its libraries to their methods, is likely too large to fit the domain of open-source libraries and vulnerabilities, and in memory. Our Java dataset contains data for 79M vertices on drawing insights from large datasets offline, instead of and 582M edges, computed from 1.4M libraries and totalling live systems. It is also generally terser (allowing interactive about 76GB. This is why SGL is at heart a graph query use; see Section II-C) and contains fewer auxiliary details language, intended to interact with databases of library code (such as a component’s vendor), but has enough to identify and vulnerabilities. software components unambiguously. Where it does better As a general-purpose graph query language, SGL’s closest than its competitors is in its support for automated verification relative is Gremlin [24], from which it inherits its core seman- and deduplication. tics. As a declarative fragment of Gremlin, it is able to express Every SGL description of a vulnerability is a graph query, path queries without joins, but is not Turing-complete. Like which can be trivially tested by executing it against a database. Gremlin, it operates at a lower level of abstraction than other This makes the ability to gather the data the only requirement graph query languages, like Cypher [14] (Neo4j), Datomic for being able to gain insight from an SGL-described vulner- Datalog [10] (Datomic), and GraQL [19] (Grakn), and has ability. limited facilities for aggregation. Furthermore,

SGL: a Domain-Specific Language for Large-Scale Analysis of Open-Source

Empirical Study on the Usage of Graph Query Languages in Open Source Java Projects

Use Splunk with Big Data Repositories Like Spark, Solr, Hadoop and Nosql Storage

Personal Information Backgrounds Abilities Highlights

Hortonworks Cybersecurity Platform Administration (April 24, 2018)

Towards an Integrated Graph Algebra for Graph Pattern Matching with Gremlin

My Steps to Learn About Apache Nifi

Licensing Information User Manual

Large Scale Querying and Processing for Property Graphs Phd Symposium∗

Handling Data Flows of Streaming Internet of Things Data

Introduction to Graph Database with Cypher & Neo4j

The Query Translation Landscape: a Survey

Full-Graph-Limited-Mvn-Deps.Pdf