Anomaly Detection in SMTP Conversations
Total Page:16
File Type:pdf, Size:1020Kb
Piotr Bazydło Accepted version of paper published in IEEE Security & Krzysztof Lasota Privacy, vol. 15, issue 6, Nov/Dec 2017, pp. 25-32 (in Adam Kozakiewicz accordance with IEEE policy on electronic information dissemination) Research and Academic Computer Network NASK DOI: 10.1109/MSP.2017.4251116 Kolska 12, 01-045 Warsaw, Poland [email protected] © 2017 IEEE. Personal use of this material is permitted. [email protected] Permission from IEEE must be obtained for all other uses, in any [email protected] current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Botnet fingerprinting method based on anomaly detection in SMTP conversations Abstract: The paper presents the results obtained during research on detection of unsolicited e-mails which are sent by botnets. The distinction from most of the existing solutions is the fact that the presented approach is based on the analysis of network traffic – the sequence and syntax of SMTP commands observed during email delivery process. The paper presents several improvements for detection of unsolicited email sources from different botnets (fingerprinting), which can be used during network forensic investigation. 1. Introduction The system of electronic mail is essential for business communication and most of the organizations cannot imagine working or even existing without e-mail accounts. On the other hand, it is also used for malicious activities. Just sending the advertising emails, besides Fig. 1. Spambot email delivery with brokers (A) and being inconvenient and wasting various types of direct (B) resources, e.g. available bandwidth (more than half of which unwanted messages reach the target e-mail the email messages are classified as spam [1]) or disk servers. In the second case (Fig. 1B), where bots are able space, carries a small risk of a security violation. Spam is to use specific conversation with mail servers, the commonly used during phishing campaigns and can number of spam sources dramatically increases, up to the include links to malicious websites or malicious number of compromised machines controlled by the attachments. botmaster. Mail servers receive messages from a huge Classical approach to e-mail forensic investigation number of sources. Filtration is no longer trivial, source [2] assumes analysis of electronic message headers and e-mail address, domain or IP address of the sender are contents to identify the sender. The direct source of spam not enough. usually turns out to be misconfigured/compromised mail A large number of scientific papers describing servers or specially prepared brokers (proxies), known as methods of spam detection have been published [5, 6]. open relays which allow uncontrolled sending of These are mainly based on the analysis of the received electronic messages. Such sources of threats can be message - its headers, content and attachments. For this trivially filtered using blacklists. purpose statistical methods [7], graph theory [8] and This in turn resulted in the development of even machine learning algorithms [9,] are used. advanced spambots - malicious programs specialized in Noteworthy are the papers of Stringhini et al. ([10] and sending spam using their own implementation or its follow-up [11]), which present quite a different existing libraries containing SMTP (Simple Mail approach. Authors focus on the analysis of network Transport Protocol - standard protocol used to send e- communication observed during the sending of a mail) client functionality. However, the modular message by a spambot, while creating a profile of so- structure of modern bots allows the same infected called SMTP dialects and comparing them with normal machines to perform other malicious activities (DDoS, traffic generated by benign mail sending applications. bitcoin mining etc.) [3, 4]. Their results show mistakes in the implementation of the Figure 1 illustrates the two main modes of spambot SMTP protocol made by the malware authors and operation. In the first mode (Fig. 1A), an infected therefore possibility of its classification. Detection of computer sends spam via the aforementioned brokers. anomalies in SMTP traffic may find application in the Thereby, it is restricting the number of locations from network forensic investigation on malicious activity of spambots [12]. Paper [13] presents in-depth analysis of S: 220 smtp.server.com spam detection through network characteristics, like IP C: EHLO my.example.com TTL derived from TCP SYN packet or completion time S: 250 smtp.server.com of TCP three way handshake. C: MAIL FROM:<[email protected]> In this paper, we focus on the development of the S: 250 2.1.0 Ok algorithm of detection and classification of messages C: RCPT TO:<[email protected]> sent by spambots. The article presents additional features S: 250 2.1.5 Ok (section 3) and results (both spam detection and botnet C: DATA fingerprinting) that we have obtained during our work on S: 354 the analysis of unsolicited emails (section 4). The main C: Test message. part of the article is preceded by a short description of C: . the structure of an SMTP conversation and a definition S: 250 2.0.0 Ok of an SMTP dialect (section 2). The paper finishes with a C: QUIT brief conclusion of our findings (section 5). S: 221 2.0.0 Bye When TCP connection is successfully established, 2. SMTP Dialects the server greets the client (with “220” command). In order to understand the idea of SMTP dialects The client responds using the “EHLO” command analysis, some basics have to be established. (or “HELO” in a deprecated version of SMTP). Then, SMTP is a global standard for email transfer. This the sender of the email is provided using “MAIL text-based protocol was defined in 1982 and updated in FROM” command. If the sender address is accepted by 2008 by RFC 5321. In simple words, it can be described the server, the recipient address is provided using as the rules of conversation between MUA (Mail User “RCPT TO”. Finally, the client signalizes beginning of Agent, e.g. Microsoft Outlook, Mozilla Thunderbird) the message through the “DATA” command and ends it clients and MTA/MDA servers (Mail Transfer Agent, using a special terminator (dot). The transmitted message Mail Delivery Agent, e.g. Postfix or Microsoft Exchange should conform to the IMF (Internet Message Format) Server), which should result in electronic mail delivery. standard. Conversation ends with the “QUIT” command Generalized mail delivery is presented in fig. 2 and and the server response. the procedure is performed in a following way: In this simple scenario, only five from eleven • 1) user provides an input (message and primary SMTP client commands are used. Every SMTP recipients) to the MUA which transfers mail to command must end with <CR><LF> (carriage return + the mail server using the SMTP protocol, line feed) line terminator. • 2a) if server is used as a relay, the MTA Extended SMTP allows the client to use available forwards the message to the appropriate mail SMTP extensions, listed by the server after successful server using SMTP, “EHLO”. For example, the client may receive • 2b) if the current MTA is the final destination information whether encrypted connections initialized by of the message, mail delivery to the local “STARTTLS” command are supported. mailbox is handled by the server's MDA (Mail Like in a spoken language, there are rules that have Delivery Agent), to be obeyed in order for an SMTP conversation to be understood by both sides, but some minor grammatical • 3) mail is downloaded from the proper mailbox differences or errors have no real impact on the by the MUA using POP (Post Office Protocol) conversation outcome. or IMAP (Internet Message Access Protocol) If for example, the MTA recognizes both “MAIL protocols. FROM:<[email protected]>” and “mail from: [email protected]" commands properly and can distinguish the address correctly. SMTP commands are case-insensitive, and some characters like “<>” or some whitespaces are optional. Various equivalent forms of SMTP commands lead to the possibility of distinguishing different SMTP implementations and their sources. As mentioned in the introduction section, first work towards analysis of different SMTP protocol implementations was presented by G. Stringhini et. al. [10]. Authors provided the solution called “B@bel”, in which SMTP dialects derived from incoming TCP sessions are analyzed and compared with the known Fig. 2. Direct and indirect email delivery. (benign) SMTP dialects. In later research [12], G. Stringhini et. al. used primary IMF fields like message SMTP conversation is carried out according to the subject or sender address to connect SMTP dialects with client-server model and looks the same, whether spam campaigns. message delivery is performed between the MUA and the MTA or two different MTA's. The following listing shows a typical SMTP conversation, where an email is sent to one recipient. 3. SMTP Dialect Extension suspicious because of the inconsistency in dialect. The As our solution is based on the concept presented in same reasoning can be applied to MTA dialects. [10, 11], we propose the following improvements to the Another information used in our research included already developed SMTP dialects fingerprinting: the IMF “Received” trace field. It is supposed to inform • analysis of full SMTP conversation (“B@bel" the current MTA about the real route of an email. If the system was limited to the investigation until email is received by an MTA for delivery to another “DATA” command), server, it has to insert a trace in the form of a “Received” • analysis of SMTP extensions syntax, field. In case of direct email delivery, there should not be • analysis of IMF fields strictly linked with any “Received” trace in IMF.