
Old Dominion University ODU Digital Commons Computer Science Faculty Publications Computer Science 2019 Sec-Lib: Protecting Scholarly Digital Libraries From Infected Papers Using Active Machine Learning Framework Nir Nissim Aviad Cohen Jian Wu Old Dominion University Andrea Lanzi Lior Rokach See next page for additional authors Follow this and additional works at: https://digitalcommons.odu.edu/computerscience_fac_pubs Part of the Computer Engineering Commons, and the Information Security Commons Original Publication Citation Nissim, N., Cohen, A., Wu, J., Lanzi, A., Rokach, L., Elovici, Y., & Giles, L. (2019). Sec-Lib: Protecting scholarly digital libraries from infected papers using active machine learning framework. IEEE Access, 7, 110050-110073. doi:10.1109/access.2019.2933197 This Article is brought to you for free and open access by the Computer Science at ODU Digital Commons. It has been accepted for inclusion in Computer Science Faculty Publications by an authorized administrator of ODU Digital Commons. For more information, please contact [email protected]. Authors Nir Nissim, Aviad Cohen, Jian Wu, Andrea Lanzi, Lior Rokach, Yuval Elovici, and Lee Giles This article is available at ODU Digital Commons: https://digitalcommons.odu.edu/computerscience_fac_pubs/141 IEEE Access· Multidi5ciplinary l Rapid Review l OpenAcce5sJournal Received July 8, 2019, accepted July 24, 2019, date of publication August 6, 2019, date of current version August 21, 2019. Digital Object Identifier 10.1109/ACCESS.2019.2933197 Sec-Lib: Protecting Scholarly Digital Libraries From Infected Papers Using Active Machine Learning Framework NIR NISSIM 1,2, AVIAD COHEN1,3, JIAN WU4, ANDREA LANZI5, LIOR ROKACH1,3, YUVAL ELOVICI1,3, AND LEE GILES6 1Malware Lab, Cyber Security Research Center (CSRC), Ben-Gurion University, Beersheba 84105, Israel 2Department of Industrial Engineering and Management, Ben-Gurion University, Beersheba 84105, Israel 3Department of Software and Information Systems Engineering, Ben-Gurion University, Beersheba 84105, Israel 4Computer Science Department, Old Dominion University, Norfolk, VA 23529, USA 5Computer Science Department, University of Milan, 20122 Milan, Italy 6Computer Science and Engineering Department, Pennsylvania State University, State College, PA 16801, USA Corresponding author: Nir Nissim ([email protected]) • • ABSTRACT Researchers from academia and the corporate-sector rely on scholarly digital libraries to access articles. Attackers take advantage of innocent users who consider the articles' files safe and thus open PDF-files with little concern. In addition, researchers consider scholarly libraries a reliable, trusted, and untainted corpus of papers. For these reasons, scholarly digital libraries are an attractive-target and inadvertently support the proliferation of cyber-attacks launched via malicious PDF-files. In this study, we present related vulnerabilities and malware distribution approaches that exploit the vulnerabilities of scholarly digital libraries. We evaluated over two-million scholarly papers in the CiteSeerX library and found the library to be contaminated with a surprisingly large number (0.3-2%) of malicious PDF documents (over 55% were crawled from the IPs of US-universities). We developed a two layered detection framework aimed at enhancing the detection of malicious PDF documents, Sec-Lib, which offers a security solution for large digital libraries. Sec-Lib includes a deterministic layer for detecting known malware, and a machine learning based layer for detecting unknown malware. Our evaluation showed that scholarly digital libraries can detect 96.9% of malware with Sec-Lib, while minimizing the number of PDF-files requiring labeling, and thus reducing the manual inspection efforts of security-experts by 98%. • • INDEX TERMS Scholarly, digital, library, paper, PDF documents, malware, malicious documents, distri- bution. I. INTRODUCTION Researchers also publish their research on their home The number of scholarly documents (English language) pages to increase exposure, reach researchers around accessible on the Web is enormous, estimated at over 114 mil- the world, and gain citations and recognition for their lion PDF documents [5], of which over 27 million (∼24%) work [6], [7]. In order to assist researchers, many scholarly can be easily accessed without payment or subscription [5]; digital libraries and search engines collect and index the since then, the estimated number of scholarly documents author's version. Thus, the papers can be easily downloaded on the Web raised significantly. These documents are freely worldwide. This free collection of scholarly documents is a available in part because researchers publish draft versions of valuable resource for most researchers and academics who their papers on their professional home pages (often within may not have a comprehensive subscription to all publishers' the domains of universities), before the final versions are content. published by the publishers. Figure 1 presents a snapshot of search results for a searched paper using Google Scholar. At the bottom of the The associate editor coordinating the review of this manuscript and page, one can access all 15 versions of the paper, already approving it for publication was Luis Javier Garcia Villalba. indexed by Google Scholar, simply by clicking on the blue 110050 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019 N. Nissim et al.: Sec-Lib: Protecting Scholarly Digital Libraries From Infected Papers Using Active Machine Learning Framework IEEEAccess· 2 [POFJ Detection of malicious pdf fi les based on hierarch ica l document structure even scan them to detect malicious content. In addition, N Srndic, P Laskov - Proceedings of the 20th Annual Network & ...• 2013 - Citeseer Ma licious PO F fi les remai n a rea l threat, in practice, to masses of computer users, even after their reputation as sources of trusted scholarly documents several high-profi le secu rity incidents. In spite of a series of a securi ty patch es issued by A dobe and other vendors, many users still have vuln erable client software inslalled on their makes digital libraries an attractive platform from which computers. The expressiveness of the PDF form at, furthermore, enables attackers to evade to take advantage of and distribute malicious PDF docu- detection w ith little effort. Apart from tra dition al an ti virus products, which are always a step behind atta ckers, few methods are known that can be deploy ed fo r protection of end-user __ ments. Attackers are aware of this chain of trust and use * £1£1 Cited by 102 Related articles All 8 versions t>t> social engineering techniques in which they take advantage FIGURE 1. Google Scholar's search results for a given academic paper, of the heavy use and blind trust of researchers in schol- including 14 additional versions of the paper. arly digital libraries and the papers (PDF documents) they download from them; once one researcher within an orga- IPDFJ Detection of malicious pdf fi les based on hierarchical document structure [PDF] psu.edu N...furulic. ~ • Citeseer nization is infected, it can quickly become a major cyber Malicious ?DF mes rema in a real threa t, in pl'actice, to masses of compu!er users, even after several higl-pl'ofile se<Uity incidents. lnspite of a series of a security pa!ches issued by security incident for the entire organization's computational Adobeandotherv11ndors, manyus11rsstilhav11vuloerab!edientsoftwareinstalledonlheir * IJIJ C~ed hy 102 Related artides W system [32]. Researchers' Web pages have become a target IPOFJ Detection of Malicious PDF Files Based on Hierarchical Document Structure [PDF] psu.edu 3 N Smdic, P Laskov - Citeseer that can be used to launch attacks. In addition, researchers, Malicious ?DF mes rema in a real threa t, in pl'actice, to masses of compu!er users, even after severa l higl-pl'ofile se<Uity inciden1s In spite of a seriesofasecuritypa!chesissuedby Adobeandothervendor$, manyusersstil havevulnerablecfientsoflwareinstalledonlheir .. professors, and research students are naturally attractive can- " didates for attack, because, due to the nature of their work, IPDFJ Detection of Malicious PDF Files Based on Hierarchical Document Structure [PDF] uni-tuebingen.de NSmdic, Plaskov • cogsys.cs.uni-tuebingen.de they have access to confidential and sensitive information, Malicious ?DF files rema in a real threat, in pl'actice, to masses of compu!er user$, even after severa l higl-pl'ofile se<Uityincidents. lnspiteofa series of a securitypa!chesissued by Adobeandothervendors, manyusersstiff havevuloerab!edientsoftwareinstatledonlheir such as nuclear knowledge, medical records, aviation, and " educational records and materials (e.g., student data, exams, IPDFJ Detection of Malicious PDF Files Based on Hierarchical Document Structure [PDF] semanticscholar.org NSmdic, P laskov - pdfa.semanticschola..org Malicious PDF mes rema in a real threa t, in prnclice, to masses of compu!er users , even after etc.). Moreover, some researchers collaborate with govern- severa l higl•l)l'ofilesecurity incidents. lnspiteofaseriesofasecuritypa!chesissuedby Adobeandothervendors, manyusersstilhavevulnerab!edientsoflwareinstalledonlheir mental agencies and industry, which allows them access " [PDFJ Detection of Malicious PDF Files Based on Hierarchical Document Structure [PDF] internetsociety.org to national and confidential information
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages26 Page
-
File Size-