US 20190236273A1 ( 19) United States (12 ) Patent Application Publication (10 ) Pub. No. : US 2019 /0236273 A1 SAXE et al. (43 ) Pub. Date : Aug. 1 , 2019 ( 54 ) METHODS AND APPARATUS FOR GO6N 3 /04 (2006 .01 ) DETECTION OF MALICIOUS DOCUMENTS G06K 9 /62 (2006 .01 ) USING MACHINE LEARNING (52 ) U .S . CI. CPC .. G06F 21/ 563 ( 2013 .01 ) ; GO6N 20 / 20 (71 ) Applicant: Sophos Limited , Abingdon (GB ) (2019 .01 ) ; G06K 9 /6256 ( 2013 .01 ) ; G06K ( 72 ) Inventors: Joshua Daniel SAXE , Los Angeles, 9 /6267 ( 2013 .01 ) ; G06N 3 / 04 ( 2013 .01 ) CA (US ) ; Ethan M . RUDD , Colorado Springs , CO (US ) ; Richard HARANG , (57 ) ABSTRACT Alexandria , VA (US ) An apparatus for detecting malicious files includes a memory and a processor communicatively coupled to the ( 73 ) Assignee : Sophos Limited , Abingdon (GB ) memory. The processor receives multiple potentially mali cious files. A first potentially malicious file has a first file ( 21) Appl . No. : 16 /257 , 749 format , and a second potentially malicious file has a second file format different than the first file format. The processor ( 22 ) Filed : Jan . 25 , 2019 extracts a first set of strings from the first potentially malicious file , and extracts a second set of strings from the Related U . S . Application Data second potentially malicious file . First and second feature (60 ) Provisional application No . 62/ 622 ,440 , filed on Jan . vectors are defined based on lengths of each string from the 26 , 2018 . associated set of strings . The processor provides the first feature vector as an input to a machine learning model to Publication Classification produce a maliciousness classification of the first potentially (51 ) Int. Ci. malicious file , and provides the second feature vector as an G06F 21/ 56 ( 2006 . 01 ) input to the machine learning model to produce a malicious GO6N 20 / 20 ( 2006 . 01 ) ness classification of the second potentially malicious file . 0 Collect documents 130 00 MW . Zip File Dump raw bytes from central directory File type ? DDDDDDDDDDD 131 132 Office Extract features (eg , string length -hash histogram ( 5 ) , N - gram document 000. Histogramis ), byte entropy histogram ( s ), byte mean - standard deviation histogram ( s ) ) from collected documents dddddd 133 mm More Yes files to analyze ? iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 131 A Use concatenated vector to train bogog00000000000000. ciassifier ( e . g ., DNN OF XGB ) Convert documents into fixed 138 length floating point feature Godddddddddddddddddddd vectors 134 Concatenate all feature vectors ( 9 . 8 ., to generate a Receiver Operating Characteristics (ROC ) curve ) into a concatenated vector 136 Patent Application Publication Aug. 1 , 2019 Sheet 1 of 13 US 2019 / 0236273 A1 WY W DataSouce(s) 0096 WOOOOOO .. V OOR att 99999999999999999999999999999999999999999999999999999999 BABISHER FIG.1A von 2000 0 0 0 1000 0000OOOOOO 000 400 9999999999999999999999999999 oryoucox 4000 you Processor110 W W 100A wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww Patent Application Publication Aug. 1 , 2019 Sheet 2 of 13 US 2019 / 0236273 A1 : : : 0000 : 0000000 : . Extractfeatures(e.g,stringlength-hashhistogramns)Ngram No 132 . 133 filestoanalyze? 00000000 . histogram(s),byteentropymean-standard Convertdocumentsintofixed lengthfloatingpointfeature vectors134 Dumprawbytesfromcentraldirectory . deviationhistogram(s)}fromcollecteddocuments More . durante . Yes . 1. 00000000000000000000000000000000000000000000000000000000 2020202020 FIG.1B QOOOoo ooooo fileZip pooo00000000000000 ososososososososososososo 20 Office document Concatenateallfeaturevectors(e.g,to DET type?File 131 Collectdocuments Useconcatenatedvectortotrain classifier(e.g,ONNOrXGB) 138 000000000000000000000000000000000000000000000000 generateaReceiverOperatingCharacteristics (ROC)curveintoaconcatenatedvector136 Sokerisessed aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa pooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 100B 000000000000000000000000000000000000000000000000000000000000000000000000000000 Patent Application Publication Aug. 1 , 2019 Sheet 3 of 13 US 2019 / 0236273 A1 File 1 Header File 2 Header File N Header 08 Central Directory Structure gooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo XXXXXXXXXXXXXX NOS End of Directory 29999999999 FIG . 2 Patent Application Publication Aug. 1 , 2019 Sheet 4 of 13 US 2019 / 0236273 A1 MSWord ExcelMSSpreadsheet televisionMSPowerpoint Presentation Document 2725929 I . 610250 SMDatasetBreakdownbyDocumentType FIG.3 OfficeOpenXML Spreadsheet OfficeOpenXML Presentation OfficeOpenXML Documenttamen Patent Application Publication Aug. 1 , 2019 Sheet 5 of 13 US 2019 / 0236273 A1 w XTA2 xgbvt FIG.4 xgbcc nnCC Patent Application Publication Aug. 1 , 2019 Sheet 6 of 13 US 2019 / 0236273 A1 MSPowerpoint Presentation MSExcel Spreadsheet WO OfficeOpenXML Spreadsheet CommonCrawlBreakdownbyDocumentType FIG.5 48383 KOOSOBOW KOK1006o XMLOpenOffice MSWord Presentation OfficeOpenXML Documentwork Document Patent Application Publication Aug. 1 , 2019 Sheet 7 of 13 US 2019 / 0236273 A1 ' ' .- . qox daapJeuoq6x- . net . * * * + + + + + + 10-2 * * , * . islomitiowwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwbokiralominimitowawwwwwwwwwwwww.weblinkoveBorsod ,12 . FIG.6 * * * 10 * RODALOM xgbconcat_deep -*.uhondokwako 10 * 000. *200.000 . 10-5 Patent Application Publication Aug. 1 , 2019 Sheet 8 of 13 US 2019 / 0236273 A1 VODOOOOOOOOOOOOOOOOOWW O . X FIG.7A w Patent Application Publication Aug. 1 , 2019 Sheet 9 of 13 US 2019/ 0236273 A1 SNOOD 6o so 40 30HashIndex 702 FIG.7B 20 10 W ASSER 0 704 Patent Application Publication Aug . 1 , 2019 Sheet 10 of 13 US 2019 /0236273 A1 ARABERA VIARIAURRAR?H?RL?AR?H?N?N ??????????? Define24FeatureVector, BasedonStringLengthsof21ac ProvideznoFeatureVectorto Receive2ndpotentially MaliciousFile 8028 Extract2.1SetofStringsfrom2nd PotentiallyMaliciousFile 8048 SetofStrings8068 MachineLearningModel 8088 MachineLearning Model(MLM) 810 FIG.8 KKKKKKKKKKKKKKK Receive14Potentially MaliciousFile 802A Extract1.SetofStringsfrom* PotentiallyMaliciousFile 804A Define1stFeatureVector,Based ofSetLengths15Stringon Strings 806A Provide1*FeatureVectorto MachineLearningModel 808A nii KKKKKKKKKKK. tuttotutututututututututotototototototototototototototototuttuttotetuttotutututututututetetottatottototototototototuttoto Uutuu ty V VIU www Patent Application Publication Aug. 1 , 2019 Sheet 11 of 13 US 2019 / 0236273 A1 Receive Potentially Malicious File with Archive Format 920 w Identity Central Directory Structure of Potentially Malicious File 922 Extract Set of Strings from Central Directory Structure 924 DOO Define Feature Vector Based on Set of Strings www 926 w Provide Feature Vector to Machine Learning Model for Maliciousness w Classification 928 FIG . 9 Patent Application Publication Aug. 1 , 2019 Sheet 12 of 13 US 2019 / 0236273 A1 Train Machine Learning Model , Using Sets of Strings , to Produce Maliciousness Classifications for Multiple File Formats 1030 Define First Feature Vector Based on length of a Set • of Strings of a First Potentially Malicious File 1032 YYYYYYYYYYYYYYYVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV Identify Maliciousness Classification of First Potentially Malicious File (FF # 1 ) Using Machine Learning Model YYYYYYYYYYYYYYYYYYYYYYYY 1034 totoo Define Second Feature Vector Based on length of a Set of Strings of a Second Potentially Malicious WWWWWWWWWWWWWWW 1036 00000000000000000000000000000000000 III Identify Maliciousness Classification of First Potentially Malicious File (FF # 2 ) Using Machine Learning Model 1038 mmmmmmm FIG . 10 Patent Application Publication Aug . 1 , 2019 Sheet 13 of 13 US 2019 /0236273 A1 RAR MozillaFirefoxExtension GoogleChromeExtension ZIP LA TAR GZIP w ArchiveLengthsbyType FIG.11 MozillaJAR Firefox Extension Google Chrome Extension TAR US 2019 /0236273 A1 Aug. 1 , 2019 METHODS AND APPARATUS FOR produce a maliciousness classification for files having a first DETECTION OF MALICIOUS DOCUMENTS file format and files having a second file format different USING MACHINE LEARNING from the first file format. The first set of strings can be from a file having the first file format and the second set of strings CROSS -REFERENCE TO RELATED can be from a file having the second file format . The method APPLICATIONS also includes defining a first feature vector based on a length of a set of strings within a first potentially malicious file [ 0001 ] This application claims priority to and the benefit having the first file format, and providing the first feature of U . S . Provisional Patent Application No . 62 /622 ,440 , filed vector to the machine learning model to identify a mali Jan . 26 , 2018 and titled “ Methods and Apparatus for Detec ciousness classification of the first potentially malicious file . tion ofMalicious Documents Using Machine Learning, ” the The method also includes defining a second feature vector content of which is incorporated herein by reference in its based on a length of a set of strings within a second entirety . potentially malicious file having the second file format , and providing the second feature vector to the machine learning BACKGROUND model to identify a maliciousness classification of the sec [ 0002 ] Someknown machine learning tools can be used to ond potentially malicious
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages27 Page
-
File Size-