Language Processing: Unix command parsing for Machine Learning

Dmitrijs Trizna Department of Computer Science University of Helsinki [email protected]

Abstract Complexity and malleability are key challenges in pre- processing shell commands. The structure has many In this article, we present a Shell Language intricate details, like aliases, different prefixes, and or- Preprocessing (SLP) library, which imple- der and values of the text, which make commands con- ments tokenization and encoding directed on densed and fast to input, but -consuming when parsing of Unix and Linux shell commands. reading and interpreting them. We describe the rationale behind the need for In this paper, we present a Shell Language Process- a new approach with specific examples when ing (SLP)1 library, showing how it performs feature conventional Natural Language Processing extraction from raw shell commands and successfully (NLP) pipelines fail. Furthermore, we eval- use it as input for different machine learning models. uate our methodology on a security classifi- The shell command-specific syntax and challenges are cation task against widely accepted informa- discussed in Chapter 2, inner working, and features of tion and communications technology (ICT) our pipeline are covered in Chapter 3. Chapters 4 and tokenization techniques and achieve signifi- 5 discuss a specific application case and provides per- cant improvement of an F1-score from 0.392 formance evaluation of different encoding techniques. to 0.874. 2 Shell specific challenges 1 Introduction The syntax of shell commands depends on different One of the most common interfaces that system en- Linux binary implementations and the way they han- gineers and administrators use to manage computers dle input parameters. Standard techniques like the is the command line. Numerous interpreters called tokenize method from the nltk package would result shells allow applying the same operational logic across in wrong tokens since shell language heavily deviates various sets of operating systems. (ab- from natural language. To start with, spaces do not breviated sh), Debian Almquist shell (dash), Z shell arXiv:2107.02438v1 [cs.LG] 6 Jul 2021 always separate different parts of the command. Some (zsh), and Bourne Again Shell () are ubiquitous commands like sed take a raw regular expression as interfaces to operate Unix systems, known for speed, a parameter, space (and any other special character) efficiency, ability to automate and integrate the diver- will not necessarily mean the start of a next token: sity of tasks. sed ’s/^chr//;s/\..* / /’ filename Contemporary enterprises rely on auditd telemetry from Unix servers, including execve syscall data de- On the contrary, java takes flags without spaces at scribing executed shell commands. Generally, this all, knowing where its parameter name ends: data is analyzed using static, manually defined sig- natures for needs like intrusion detection. There is java -Xms256m -Xmx2048m -jar remoting.jar sparse evidence of the ability to use this information source efficiently by modern data analysis techniques, Such malleability in syntactic patterns possess a signif- like machine learning or statistical inference. Leading icant challenge for the successful parsing of shell com- ICT companies underline the relevance of the afore- mand lines. We know two libraries that attempt to ad- mentioned problem and perform internal research to address this issue [5]. 1https://github.com/dtrizna/slp Shell Language Processing dress such specifics for bash - bashlex 2 and bashlint 3, models in libraries like scikit-learn [1] or even AutoML however none of them does its job perfectly. We utilize realizations like TPOT [3]. As a result, every adminis- bashlex for our needs as a primary parsing source since trator, security analyst, or even non-technical manager it provides heuristics for tokenization of the most gen- working with a Unix-like system may parse shell data eral patterns of shell commands. Still, there are mul- for ML-based analytics using our library. Only ba- tiple problems in the bashlex library’s original syntax sic Python coding skills and no knowledge of conven- analysis process, for example in the ”command within tional Natural Language Processing (NLP) pipelines a command” case (known by shell syntax $(cmd) or are needed. `cmd` ) like in:

export IP=$(dig +short example.com) 4 Experimental setup

Such syntax is not handled by bashlex, where an em- Assessment of tokenization and encoding quality is bedded command is treated as a single element. There- done on the security classification problem, where we fore we implement additional syntactic logic wrapped train an ML model to distinguish malicious command around bashlex’s object classes to handle this and samples from benign activity. Legitimate commands other problematic cases. consist from nl2bash dataset [4], which represents a shell commands collected from question-answering fo- 3 Encoding rums like stackoverflow.com or administratively fo- cused cheat-sheets. Subsequently, we decided to look forward to different We perform the collection of malicious samples by our- ways of representing data numerically. To produce ar- selves, accumulating harmful examples across penetra- rays out of textual data we implement (a) label, (b) tion testing and hacking resources that describe how one-hot, and () term frequency-inverse document fre- to perform enumeration of Linux targets and acquire quency (TF-IDF) encodings. Label encoding is built reverse shell connections from Unix hosts4. All com- on top of the scikit-learn [1], whereas the one-hot and mands within the dataset are normalized. Domain TF-IDF encodings are implemented natively. names are replaced by example.com and all IP ad- dresses with 1.1.1.1, since in our evaluation we want Having the tokenization and encoding now as oper- to focus on the command interpretability. We advice ational functionalities, we were ready to provide a to perform a similar normalization even in production user-friendly interface to utilize the underlying code. applications, otherwise ML model will overfit training Therefore, we created dedicated Python classes that data. formed the core of our interface library. The tok- enization interface is available via ShellTokenizer() For classification, we train a gradient boosting ensem- class. Besides bashlex, we utilized a Counter object ble of decision trees, with the specific realization from from the collections package to store unique tokens and XGBoost library [2]. During the analysis of tokeniza- their appearance count. The Counter allows working tion we do not implement a validation, ergo use full conveniently with the data concerning various visual- dataset for training and compute metrics on the same izations. Further encoding is available by ShellEn- data. Conversely, during the evaluation of encoding coder(), which class allows to generate different en- techniques, we implement a 10 fold cross-validation, coding methods and returns a scipy sparse array. Sub- so the resulting metrics are the mean across all valida- stantially, all preprocessing pipeline can be achieved tion passes. within four lines of code: st = ShellTokenizer() 5 Evaluation corpus, counter = st.tokenize(shell_commands) First experiments were conducted to assess the qual- se = ShellEncoder(corpus=corpus, ity of our tokenization. We have tokenized afore- token_counter=counter, mentioned dataset using our ShellTokenizer, with top_tokens=100) alternatives from NLTK WordPunctTokenizer, and X_enc = se.tfidf() WhiteSpaceTokenizer which are known to be used in ICT industry for log parsing. Then all the three At this stage X enc can be supplied to fit() method corpora are encoded using the same term frequency- of API-interface supported by machine learning (ML) inverse document frequency (TF-IDF) realization.

2https://github.com/idank/bashlex 4For example https://blog.g0tmi1k.com/2011/08/ 3https://github.com/skudriashev/bashlint basic-linux-privilege-escalation/ Dmitrijs Trizna

Tokenizer AUC F1 Precision Recall ing techniques has a clear preference over the others. Therefore we implemented all three encoding types SLP (ours) 0.994 0.874 0.980 0.789 since even basic logic like label encoding can yield the WordPunct 0.988 0.392 1.0 0.244 best results on specific data types and problem state- WhiteSpace 0.942 0.164 1.0 0.089 ments. We encourage analysts to experiment with various shell preprocessing techniques to understand Table 1: Comparison of tokenization tecnhiques on se- which way benefits their pipelines. curity classification task: SLP, WordPunctTokenizer, WhiteSpaceTokenizer. 6 Conclusions

In this article, we presented custom tokenization and encoding techniques focused on commands. We describe the rationale of the dedicated tokenization approach, with specific shell command examples where conventional NLP techniques fail, and briefly cover the inner working of our library. To distinguish this tech- nique from known existing pipelines we evaluate the security classification task, with a custom dataset col- lected from real-world samples across the web and an efficient ensemble model. According to acquired met- rics, our model achieves a significant improvement of the F1-score.

References [1] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Figure 1: Received operating characteristic (ROC) Languages for Data Mining and Machine Learning, curve with error’s standard deviation across ten cross- pages 108–122, 2013. validation folds for all three encoding techniques. [2] T. Chen and C. Guestrin. Xgboost. Proceedings of the 22nd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining, Security classification results are available in Table 1. Aug 2016. doi: 10.1145/2939672.2939785. URL We see that conventional tokenization techniques fail http://dx.doi.org/10.1145/2939672.2939785. to distinguish crucial syntactic patterns of commands, therefore both cases are biased towards the majority [3] T. T. Le, W. Fu, and J. H. Moore. Scaling tree- class of benign samples (with high precision and low based automated machine learning to biomedical recall). This situation is unacceptable for tasks like big data with a feature set selector. Bioinformatics, security classification - a model that fails to detect 36(1):250–256, 2020. malicious samples at all (false-negatives) cannot be de- [4] X. V. Lin, C. Wang, L. Zettlemoyer, and M. D. ployed in a production environment. On the contrary, Ernst. Nl2bash: A corpus and semantic parser for our tokenizer achieves significantly higher recall val- natural language interface to the linux operating ues, and, consequently, F1-score (0.874 versus 0.392 system, 2018. from the closest alternative). [5] A. Patel, Z. Kunik, Z. Kocur, J. Bedziechowska, Afterward, we focus on the evaluation of different en- B. Nawrotek, P. Piasecki, and M. Kowiel. De- coding techniques. We continue tests with our tok- tecting adversarial behaviour by applying nlp enization implementation and analyze all three sup- techniques to command lines. https://blog. ported encodings - TF-IDF, one hot, and label encod- f-secure.com/command-lines/, 2020. ing. We display the receiver operating characteristic (ROC) curve and Area-Under-the-Curve (AUC) val- ues in Figure 1. It is seen that none of the encod-