A Formal Approach Towards Assessing the Effectiveness of Anti-Spam Procedures
Total Page:16
File Type:pdf, Size:1020Kb
A Formal Approach towards Assessing the Effectiveness of Anti-spam Procedures Guido Schryen Institute of Business Information Systems, RWTH Aachen University [email protected] Abstract heuristic, anti-spam approaches. A formal framework Spam e-mails have become a serious technological within which the modes of spam delivery, anti-spam and economic problem. So far we have been approaches, and their effectiveness can be reasonably able to resist spam e-mails and use the investigated, may encourage a shift in methodology Internet for regular communication by deploying and pave the way for new, holistic anti-spam complementary anti-spam approaches. However, if approaches. In Section 2 the Internet e-mail we are to avert the danger of losing the Internet e- infrastructure is modeled as a directed graph and a mail service as a valuable, free, and worldwide deterministic finite automaton and the appropriateness medium of open communication, anti-spam activities of the model is proved. In section 3 all possible should be performed more systematically than is done modes of sending (spam) e-mails are derived and in current, mainly heuristic, anti-spam approaches. A presented as regular expressions. These (technically- formal framework within which the modes of spam oriented) expressions are then grouped into categories delivery, anti-spam approaches, and their according to types of organization participating in e- effectiveness can be investigated, may encourage a mail delivery. The effectiveness of today’s most shift in methodology and pave the way for new, important anti-spam approaches is assessed in section holistic anti-spam approaches. 4 by matching them with the modes of spamming. This paper presents a model of the Internet e-mail Section 5 summarizes the results presented in this infrastructure as a directed graph and a deterministic manuscript and outlines future work. finite automaton, and draws on automata theory to formally derive the modes of spam delivery possible. 2. Model of the Internet e-mail Finally the effectiveness of anti-spam approaches in infrastructure terms of coverage of spamming modes is assessed. The Internet e-mail infrastructure can be modeled as a directed graph G which is defined in the first . 1. Introduction subsection. In the second subsection it is shown that all types of e-mail deliveries can be described with a Spam e-mails have become a serious technological set of (directed) paths in G. As each option to send e- and economic problem. So far we have been mails at the same time represents an option to send reasonably able to resist spam e-mails, although spam e-mails the set of spamming options and the set statistics show a proportion of spam higher than 60% of mailing options are regarded to be equal. [11, 12]. The availability of the Internet e-mail system for regular e-mail communication is currently 2.1 Definition ensured by complementary anti-spam approaches, mainly by blocking and filtering procedures. Since the Internet e-mail network infrastructure However, if we are to avert the danger of losing the which the graph is intended to represent is dynamic, it Internet e-mail service as a valuable, free, and is not meaningful to model each concrete e-mail node. worldwide medium of open communication, anti- The different types of Internet e-mail nodes on the spam activities should be performed more other hand are static, and it is these which can serve systematically than is done in current, mainly the purpose. An e-mail node is here defined as a software unit which is involved in the Internet e-mail extensions, also referred to as ESMTP, such as SMTP delivery process and works on the TCP/IP application Service Extension for Authentication (RFC 2554), layer. Consideration of software working exclusively Deliver By SMTP Service Extension (RFC 2852), on lower levels such as routers and bridges is beyond SMTP Service Extension for Returning Enhanced the scope of this work, as are options to send an e- Error (RFC 2034), and SMTP Service Extension for mail without any SMTP communication with an e- Secure SMTP over Transport Layer Security (RFC mail node of the recipient’s organization. However, 3207); see www.iana.org/assignments/mail- this does not seem to be an important restriction, as parameters for a list of SMTP service extensions. * almost all e-mail users receive their e-mails from a The set SMTP contains the set SMTP and all server that is (directly or indirectly) SMTP connected SMTP extensions specified for e-mail submission to the Internet. from a Mail User Agent (MUA) to an e-mail node Let G={V,E,c} be a directed graph with vertex set with an SMTP incoming interface. This e-mail node V and edge set E, and let c: E→L be a total function can be a Mail Transfer Agent (MTA) as specified in on E where L denotes a set of (protocol) labels. First RFC 2821 or a Message Submission Agent (MSA) as the structure of the graph is presented graphically (see specified in RFC 2476. With reference to the latter figure 1) and formally. Its semantics is then explained MTAinc can alternatively be denoted as in more detail. sendOrg inc The set of vertices can be depicted as the disjoint MSAsendOrg and MTArecOrg as MSArecOrg ,respectively. union of five vertex sets V1,…,V5. Each of these sets Port 587 is reserved for e-mail message submission. is attached to one of the organizational units However, while most e-mail clients and servers can participating in e-mail delivery: sender, sending be configured to use port 587 instead of 25, there are organization or e-mail (service) provider, Internet, cases where this is not possible or convenient. A site receiving organization, and recipient. In cases where may use port 25 even for message submission. Using the recipient does not use a receiving organization as an MSA numerous methods can be applied to ensure e-mail service provider (ESP) but runs his own e-mail that only authorized users can submit messages. receiving and processing environment the These methods include authenticated SMTP, IP organizational units receiving organization and address restrictions, secure IP, and prior Post Office recipient merge. This, however, does not affect the Protocol (POP) authentication, where clients are structure of the graph, which retains its general required to authenticate their identity before an SMTP validity. The set of vertices be V=V ∪…∪V with * 1 5 submission session (“SMTP after POP”). SMTP is V = {MTA , MUA ,OtherAgent } set of 1 send send send the union of three sets of protocols. The first contains vertices attached to sender, all Internet application protocols excepting SMTP*, inc set of V2 = {MTAsendOrg , MTAsendOrg ,WebServsendOrg } the second, all proprietary application protocols used vertices attached to sending organization, on the Internet: this inclusion takes tunneling procedures into account. The third set, since use of V3 = {SMTP − Relay,GWSMTP,B ,GWA,SMTP ,GWA,B } application protocols is not mandatory for the set of vertices attached to Internet, exchange of data in a network, are all Internet V = {MTAinc , MTA , MDA , 4 recOrg recOrg recOrg protocols on the transport and network layer of the MailServrecOrg ,WebServrecOrg } Department of Defense (DoD) model, such as TCP set of vertices attached to receiving organization, and and IP. MAP is the set of all e-mail access protocols set of vertices attached to used to transfer e-mails from the recipient’s e-mail V5 = {MUArecOrg } server to his MUA. Internet Message Access Protocol recipient. (IMAP) version 4 (RFC 1730) and POP version 3 The set of (protocol) labels be (RFC 1939) are here among the protocols most * * L = {SMTP, SMTP , SMTP , HTTP(S), INT, MAP}. deployed. The set HTTP(S) contains the protocols E and c are not defined formally here, but shown HTTP (RFC 2616) as well as its secure versions graphically in figure 1. “HTTP over SSL” [3] and “HTTP over TLS” Each vertex corresponds to a type of e-mail node. (RFC2818). Finally, the set INT denotes protocols An edge e=(v1,v2) with a label c(e) ∈ L attached and procedures for internal e-mail delivery, i. e. inside exists if and only if the Internet e-mail infrastructure the receiving organization, such as getting e-mails allows e-mail flow between the corresponding node from an internal MTA and storing them into the types; c(e) denotes the set of feasible protocols. users’ e-mail boxes. The set SMTP contains SMTP (as a protocol) extended by all IANA-registered SMTP service sender sending organization Internet receiving organization recipient SMTP e 1 SMTP e SMTP K 2 e 14 F inc e e inc MTA SMTP MTA 13 18 SMTP MTA send e sendOrg MTAsendOrg recOrg 3 SMTP e31 e19 e4 e15 e e20 SMTP e SMTP A SMTP 17 32 D SMTP SMTP SMTP* G INT e SMTP e MTArecOrg SMTP* 16 21 SMTP SMTP SMTP-Relay SMTP e33 e34 INT SMTP e5 e 6 SMTP* INT e22 e23 e SMTP 7 e SMTP 12 SMTP MDA MUAsend e8 recOrg HTTP(S) WebServsendOrg GWSMTP,B e e24 35 INT B MUArec e E H e e27 9 25 e37 SMTP MailServ MAP SMTP* SMTP* recOrg e26 e36 HTTP(S) INT GW e28 SMTP* A,SMTP e10 I e38 OtherAgentsend e 11 SMTP* SMTP* WebServrecOrg e29 SMTP* C GWA,B J e30 SMTP* Figure 1: Internet e-mail infrastructure as directed graph inc 2.2 Appropriateness vstart ∈Vstart = V1 ∪{MTAsendOrg , MTAsendOrg } and v ∈V = {MailServ , MTA , MTAinc } The meaning of G in the context of e-mail delivery end end recOrg recOrg recOrg is given by the fact that the different types of e-mail and P the set of all these paths.