Bogog00000000000000

Bogog00000000000000

US 20190236273A1 ( 19) United States (12 ) Patent Application Publication (10 ) Pub. No. : US 2019 /0236273 A1 SAXE et al. (43 ) Pub. Date : Aug. 1 , 2019 ( 54 ) METHODS AND APPARATUS FOR GO6N 3 /04 (2006 .01 ) DETECTION OF MALICIOUS DOCUMENTS G06K 9 /62 (2006 .01 ) USING MACHINE LEARNING (52 ) U .S . CI. CPC .. G06F 21/ 563 ( 2013 .01 ) ; GO6N 20 / 20 (71 ) Applicant: Sophos Limited , Abingdon (GB ) (2019 .01 ) ; G06K 9 /6256 ( 2013 .01 ) ; G06K ( 72 ) Inventors: Joshua Daniel SAXE , Los Angeles, 9 /6267 ( 2013 .01 ) ; G06N 3 / 04 ( 2013 .01 ) CA (US ) ; Ethan M . RUDD , Colorado Springs , CO (US ) ; Richard HARANG , (57 ) ABSTRACT Alexandria , VA (US ) An apparatus for detecting malicious files includes a memory and a processor communicatively coupled to the ( 73 ) Assignee : Sophos Limited , Abingdon (GB ) memory. The processor receives multiple potentially mali cious files. A first potentially malicious file has a first file ( 21) Appl . No. : 16 /257 , 749 format , and a second potentially malicious file has a second file format different than the first file format. The processor ( 22 ) Filed : Jan . 25 , 2019 extracts a first set of strings from the first potentially malicious file , and extracts a second set of strings from the Related U . S . Application Data second potentially malicious file . First and second feature (60 ) Provisional application No . 62/ 622 ,440 , filed on Jan . vectors are defined based on lengths of each string from the 26 , 2018 . associated set of strings . The processor provides the first feature vector as an input to a machine learning model to Publication Classification produce a maliciousness classification of the first potentially (51 ) Int. Ci. malicious file , and provides the second feature vector as an G06F 21/ 56 ( 2006 . 01 ) input to the machine learning model to produce a malicious GO6N 20 / 20 ( 2006 . 01 ) ness classification of the second potentially malicious file . 0 Collect documents 130 00 MW . Zip File Dump raw bytes from central directory File type ? DDDDDDDDDDD 131 132 Office Extract features (eg , string length -hash histogram ( 5 ) , N - gram document 000. Histogramis ), byte entropy histogram ( s ), byte mean - standard deviation histogram ( s ) ) from collected documents dddddd 133 mm More Yes files to analyze ? iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 131 A Use concatenated vector to train bogog00000000000000. ciassifier ( e . g ., DNN OF XGB ) Convert documents into fixed 138 length floating point feature Godddddddddddddddddddd vectors 134 Concatenate all feature vectors ( 9 . 8 ., to generate a Receiver Operating Characteristics (ROC ) curve ) into a concatenated vector 136 Patent Application Publication Aug. 1 , 2019 Sheet 1 of 13 US 2019 / 0236273 A1 WY W DataSouce(s) 0096 WOOOOOO .. V OOR att 99999999999999999999999999999999999999999999999999999999 BABISHER FIG.1A von 2000 0 0 0 1000 0000OOOOOO 000 400 9999999999999999999999999999 oryoucox 4000 you Processor110 W W 100A wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww Patent Application Publication Aug. 1 , 2019 Sheet 2 of 13 US 2019 / 0236273 A1 : : : 0000 : 0000000 : . Extractfeatures(e.g,stringlength-hashhistogramns)Ngram No 132 . 133 filestoanalyze? 00000000 . histogram(s),byteentropymean-standard Convertdocumentsintofixed lengthfloatingpointfeature vectors134 Dumprawbytesfromcentraldirectory . deviationhistogram(s)}fromcollecteddocuments More . durante . Yes . 1. 00000000000000000000000000000000000000000000000000000000 2020202020 FIG.1B QOOOoo ooooo fileZip pooo00000000000000 ososososososososososososo 20 Office document Concatenateallfeaturevectors(e.g,to DET type?File 131 Collectdocuments Useconcatenatedvectortotrain classifier(e.g,ONNOrXGB) 138 000000000000000000000000000000000000000000000000 generateaReceiverOperatingCharacteristics (ROC)curveintoaconcatenatedvector136 Sokerisessed aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa pooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 100B 000000000000000000000000000000000000000000000000000000000000000000000000000000 Patent Application Publication Aug. 1 , 2019 Sheet 3 of 13 US 2019 / 0236273 A1 File 1 Header File 2 Header File N Header 08 Central Directory Structure gooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo XXXXXXXXXXXXXX NOS End of Directory 29999999999 FIG . 2 Patent Application Publication Aug. 1 , 2019 Sheet 4 of 13 US 2019 / 0236273 A1 MSWord ExcelMSSpreadsheet televisionMSPowerpoint Presentation Document 2725929 I . 610250 SMDatasetBreakdownbyDocumentType FIG.3 OfficeOpenXML Spreadsheet OfficeOpenXML Presentation OfficeOpenXML Documenttamen Patent Application Publication Aug. 1 , 2019 Sheet 5 of 13 US 2019 / 0236273 A1 w XTA2 xgbvt FIG.4 xgbcc nnCC Patent Application Publication Aug. 1 , 2019 Sheet 6 of 13 US 2019 / 0236273 A1 MSPowerpoint Presentation MSExcel Spreadsheet WO OfficeOpenXML Spreadsheet CommonCrawlBreakdownbyDocumentType FIG.5 48383 KOOSOBOW KOK1006o XMLOpenOffice MSWord Presentation OfficeOpenXML Documentwork Document Patent Application Publication Aug. 1 , 2019 Sheet 7 of 13 US 2019 / 0236273 A1 ' ' .- . qox daapJeuoq6x- . net . * * * + + + + + + 10-2 * * , * . islomitiowwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwbokiralominimitowawwwwwwwwwwwww.weblinkoveBorsod ,12 . FIG.6 * * * 10 * RODALOM xgbconcat_deep -*.uhondokwako 10 * 000. *200.000 . 10-5 Patent Application Publication Aug. 1 , 2019 Sheet 8 of 13 US 2019 / 0236273 A1 VODOOOOOOOOOOOOOOOOOWW O . X FIG.7A w Patent Application Publication Aug. 1 , 2019 Sheet 9 of 13 US 2019/ 0236273 A1 SNOOD 6o so 40 30HashIndex 702 FIG.7B 20 10 W ASSER 0 704 Patent Application Publication Aug . 1 , 2019 Sheet 10 of 13 US 2019 /0236273 A1 ARABERA VIARIAURRAR?H?RL?AR?H?N?N ??????????? Define24FeatureVector, BasedonStringLengthsof21ac ProvideznoFeatureVectorto Receive2ndpotentially MaliciousFile 8028 Extract2.1SetofStringsfrom2nd PotentiallyMaliciousFile 8048 SetofStrings8068 MachineLearningModel 8088 MachineLearning Model(MLM) 810 FIG.8 KKKKKKKKKKKKKKK Receive14Potentially MaliciousFile 802A Extract1.SetofStringsfrom* PotentiallyMaliciousFile 804A Define1stFeatureVector,Based ofSetLengths15Stringon Strings 806A Provide1*FeatureVectorto MachineLearningModel 808A nii KKKKKKKKKKK. tuttotutututututututututotototototototototototototototototuttuttotetuttotutututututututetetottatottototototototototuttoto Uutuu ty V VIU www Patent Application Publication Aug. 1 , 2019 Sheet 11 of 13 US 2019 / 0236273 A1 Receive Potentially Malicious File with Archive Format 920 w Identity Central Directory Structure of Potentially Malicious File 922 Extract Set of Strings from Central Directory Structure 924 DOO Define Feature Vector Based on Set of Strings www 926 w Provide Feature Vector to Machine Learning Model for Maliciousness w Classification 928 FIG . 9 Patent Application Publication Aug. 1 , 2019 Sheet 12 of 13 US 2019 / 0236273 A1 Train Machine Learning Model , Using Sets of Strings , to Produce Maliciousness Classifications for Multiple File Formats 1030 Define First Feature Vector Based on length of a Set • of Strings of a First Potentially Malicious File 1032 YYYYYYYYYYYYYYYVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV Identify Maliciousness Classification of First Potentially Malicious File (FF # 1 ) Using Machine Learning Model YYYYYYYYYYYYYYYYYYYYYYYY 1034 totoo Define Second Feature Vector Based on length of a Set of Strings of a Second Potentially Malicious WWWWWWWWWWWWWWW 1036 00000000000000000000000000000000000 III Identify Maliciousness Classification of First Potentially Malicious File (FF # 2 ) Using Machine Learning Model 1038 mmmmmmm FIG . 10 Patent Application Publication Aug . 1 , 2019 Sheet 13 of 13 US 2019 /0236273 A1 RAR MozillaFirefoxExtension GoogleChromeExtension ZIP LA TAR GZIP w ArchiveLengthsbyType FIG.11 MozillaJAR Firefox Extension Google Chrome Extension TAR US 2019 /0236273 A1 Aug. 1 , 2019 METHODS AND APPARATUS FOR produce a maliciousness classification for files having a first DETECTION OF MALICIOUS DOCUMENTS file format and files having a second file format different USING MACHINE LEARNING from the first file format. The first set of strings can be from a file having the first file format and the second set of strings CROSS -REFERENCE TO RELATED can be from a file having the second file format . The method APPLICATIONS also includes defining a first feature vector based on a length of a set of strings within a first potentially malicious file [ 0001 ] This application claims priority to and the benefit having the first file format, and providing the first feature of U . S . Provisional Patent Application No . 62 /622 ,440 , filed vector to the machine learning model to identify a mali Jan . 26 , 2018 and titled “ Methods and Apparatus for Detec ciousness classification of the first potentially malicious file . tion ofMalicious Documents Using Machine Learning, ” the The method also includes defining a second feature vector content of which is incorporated herein by reference in its based on a length of a set of strings within a second entirety . potentially malicious file having the second file format , and providing the second feature vector to the machine learning BACKGROUND model to identify a maliciousness classification of the sec [ 0002 ] Someknown machine learning tools can be used to ond potentially malicious

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    27 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us