RTFM! Automatic Assumption Discovery and Verification Derivation from Library Document for API Misuse Detection

Session 6B: Exploitation and Defenses CCS '20, November 9–13, 2020, Virtual Event, USA RTFM! Automatic Assumption Discovery and Verification Derivation from Library Document for API Misuse Detection Tao Lv1,2, Ruishi Li1,2, Yi Yang1,2, Kai Chen1,2,∗ Xiaojing Liao3, XiaoFeng Wang3, Peiwei Hu1,2, Luyi Xing3 1SKLOIS, Institute of Information Engineering, Chinese Academy of Sciences, China 2School of Cyber Security, University of Chinese Academy of Sciences, China 3Luddy School of Informatics, Computing and Engineering, Indiana University Bloomington {lvtao,liruishi,yangyi,chenkai,hupeiwei}@iie.ac.cn,{xliao,xw7,luyixing}@indiana.edu Abstract Our analysis discovered 193 API misuses, including 139 flaws never To use library APIs, a developer is supposed to follow guidance reported before. and respect some constraints, which we call integration assumptions (IAs). Violations of these assumptions can have serious conse- CCS Concepts quences, introducing security-critical flaws such as use-after-free, • Security and privacy ! Vulnerability scanners; • Software NULL-dereference, and authentication errors. Analyzing a program and its engineering ! Software safety. for compliance with IAs involves significant effort and needs to be automated. A promising direction is to automatically recover IAs Keywords from a library document using Natural Language Processing (NLP) API misuse; Integration assumption; Verification code generation; and then verify their consistency with the ways APIs are used in a Documentation analysis program through code analysis. However, a practical solution along ACM Reference Format: this line needs to overcome several key challenges, particularly the Tao Lv, Ruishi Li, Yi Yang, Kai Chen, Xiaojing Liao, XiaoFeng Wang, Pei- discovery of IAs from loosely formatted documents and interpre- wei Hu, Luyi Xing. 2020. RTFM! Automatic Assumption Discovery and tation of their informal descriptions to identify complicated con- Verification Derivation from Library Document for API Misuse Detec- straints (e.g., data-/control-flow relations between different APIs). tion. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and In this paper, we present a new technique for automated assump- Communications Security (CCS ’20). ACM, New York, NY, USA, 16 pages. tion discovery and verification derivation from library documents. https://doi.org/10.1145/3372297.3423360 Our approach, called Advance, utilizes a suite of innovations to address those challenges. More specifically, we leverage the obser- 1 Introduction vation that IAs tend to express a strong sentiment in emphasizing Today’s software is more composed than built, with existing pro- the importance of a constraint, particularly those security-critical, gram components, even online services, being extensively reused, and utilize a new sentiment analysis model to accurately recover often through integration of their Application Programming In- them from loosely formatted documents. These IAs are further terfaces (APIs). These APIs may only operate as expected under processed to identify hidden references to APIs and parameters, some constraints (referred to as integration assumptions or IA in through an embedding model, to identify the information-flow re- our research), on their inputs (called pre-conditions, like a limit on lations expected to be followed. Then our approach runs frequent parameter lengths), outputs (called post-conditions, e.g., “return 1 or subtree mining to discover the grammatical units in IA sentences 0”) and invocation context (called context conditions, such as “must that tend to indicate some categories of constraints that could have be called from the same thread”), which their users should be aware security implications. These components are mapped to verification of. Although these IAs are typically elaborated in library documen- code snippets organized in line with the IA sentence’s grammatical tation, they may not be strictly followed by software developers, structure, and can be assembled into verification code executed as reported by prior studies [42, 44]. This affects the functionality through CodeQL to discover misuses inside a program. We imple- of the software using these APIs, and often comes with serious mented this design and evaluated it on 5 popular libraries (OpenSSL, security or privacy implications. As an example, setuid, which SQLite, libpcap, libdbus and libxml2) and 39 real-world applications. temporarily carries a high privilege, is expected to drop its privilege when its task is done, and failing to do so needs to be checked, as ∗Corresponding author. indicated in the libstdc documentation “it is a grave security error to omit to check for a failure return from setuid()”; this instruction, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed however, has not been respected by PulseAudio 0.9.8, leading to a for profit or commercial advantage and that copies bear this notice and the full citation local privilege escalation risk (CVE-2008-0008). on the first page. Copyrights for components of this work owned by others than ACM The fundamental cause of API misuse is that developers do not must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a follow the IAs of APIs in documents, which therefore becomes the fee. Request permissions from [email protected]. key sources for API misuse discovery. Prior research on API misuse CCS ’20, November 9–13, 2020, Virtual Event, USA detection and API specification discovery21 [ , 30, 37, 41, 52] rely on © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7089-9/20/11...$15.00 analyzing a set of code to infer putative IAs, so as to identify API https://doi.org/10.1145/3372297.3423360 misuses. However, these IAs are often less accurate, introducing 1837 Session 6B: Exploitation and Defenses CCS '20, November 9–13, 2020, Virtual Event, USA significant false positives. In the meantime, the limited coverage library documentation to recover and interpret IAs and then trans- of the code set can also cause many real-world misuse cases to fall lating them into verification code executed by CodeQL [4] to find through the cracks (Section 5.3). the API misuses in a program’s API integration. For this purpose, Finding misuse from doc: challenges. Alternatively, one can Advance is designed to overcome the technical barriers mentioned recover their IAs from documentation for a compliance check on earlier, based upon several key observations. More specifically, we the software integrating them, which leverages more semantic in- found that with the diversity in the ways IAs are described, the formation and therefore can be more accurate. Given the size of related sentences are characterized by a strong sentiment to empha- today’s library documents (e.g., 327 pages for OpenSSL), manual size the constraints that API users are expected to follow: e.g., “the inspection becomes less realistic. Prior research seeks to automate application must finalize every prepared statement”; “make sure that this process by inferring these assumptions using natural language you explicitly check for PCAP_ERROR”; “it is the job of the callback processing (NLP) [42, 44, 47]. Particularly, it has been shown that to store information about the state of the last call”. Such sentiments for the well-formatted C# documents, a set of templates that specify cannot be easily described by templates but can be captured using the relations between subject and object are adequate in recovering semantic analysis. In our research, we utilized a Hierarchical Atten- IAs. This simple approach, however, no longer works well on less or- tion Network (HAN)[51] to capture the sentences carrying such ganized documents, like those for most C libraries, such as OpenSSL, a sentiment when the entities (API names, parameters) under the where IA descriptions can be hard to fit into any fixed templates: constraint (as expressed by the sentiment) can be controlled by the as an example, we ran the code released by the prior research on API caller (Section 3.2). This has been done by training a classifier OpenSSL and only observed accuracy of 49% (Section 5.3). How to named S-HAN on 5,186 annotated data items, which is shown to effectively identify the API constraints from such documents turns work effectively in inferring IA sentences from library documents. out to be nontrivial. Also discovered in our research is the pervasiveness of similar IA Also the challenge is to automatically analyze each IA and de- components within a document and in some cases, across different termine a program’s compliance with the assumption. The prior libraries. These components imply common assumption (constraint) research maps a set of predicates on pre- and post-conditions to types for certain kinds of operations, which are often among the formal expressions (e.g., “greater” to “¡”) [44]. Less known is how most important and can be security critical. For example, “... is these predicates are selected and whether they are representative. not threadsafe”, an expression describing thread safety; “... must be Most concerning is the missing of semantic analysis on context freed by...”, an IA asking for resource release – an operation with a conditions (except on API order [42, 54]), which can describe the significant security

RTFM! Automatic Assumption Discovery and Verification Derivation from Library Document for API Misuse Detection

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support