<<

Implementation of Proactive Spam Fighting Teniques

Masterarbeit

von Martin Gräßlin Rupret-Karls-Universität Heidelberg

Betreuer: Prof. Dr. Gerhard Reinelt Prof. Dr. Felix Freiling

03. März 2010

Ehrenwörtlie Erklärung

I versiere, dass i diese Masterarbeit selbstständig verfasst, nur die angegebenen ellen und Hilfsmiel verwendet und die Grundsätze und Empfehlungen „Verantwortung in der Wissensa“ der Universität Heidelberg beatet habe.

Ort, Datum Martin Gräßlin

Abstract

One of the biggest allenges in global communication is to overcome the problem of unwanted , commonly referred to as spam. In the last years many approaes to reduce the number of spam emails have been proposed. Most of them have in common that the end-user is still required to verify the filtering results. ese approaes are reactive: before mails can be classified as spam in a reliable way, a set of similar mails have to be received. Spam fighting has to become proactive. Unwanted mails have to be bloed before they are delivered to the end-user’s . In this thesis the implementation of two proactive spam fighting teniques is discussed. e first concept, called -Shake, introduces an authentication before a sender is allowed to send emails to a new contact. Computers are unable to authenticate themselves and so all spam messages are automatically bloed. e development of this concept is discussed in this thesis. e second concept, called Spam Templates, is motivated by the fact that spam messages are generated from a common template. If we gain access to the template we are able to identify spam messages by mating the message against the template. As the template is generated from currently sent spam messages, the template will never mat a legitimate mail. In this thesis mating a mail against a template is implemented. In the scope of this thesis an evaluation for the Mail-Shake concept is provided. is evaluation shows that Mail-Shake is able to reduce the number of received spam messages and mails containing malicious soware.

V

Acknowledgement

First of all I want to thank Professor Gerhard Reinelt and Professor Felix Freiling for making it possible for me to write this thesis at the Laboratory for Dependable Distributed Systems at the University Mannheim. I also want to thank my supervisors Jan Göbel and Philipp Trinius. eir suggestions and feedba are very mu appreciated and helped to develop the system presented in this thesis. A special thanks to all my friends and my family for testing the system and providing valuable feedba on its usability. I especially want to thank Arthur Arlt who was always willing to discuss details about the implementation and this document. I want to thank the KDE community and Development Frameworks for providing su a great and coherent development framework. e KDE community has helped me improve my C++ coding skills during the last years. is was useful during the implementation as many problems were already known and could be solved easily. In general I want to thank the complete Free and Open Source community. Without their ideas of free soware it would not have been possible to realize su a project. e complete project including this document has been implemented and wrien with the help of Free or Open Source soware. Last but not least I want to thank my parents for their financial support during my Master studies so that I could concentrate on my classes.

VII

Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Proactive Spam Fighting Teniques ...... 2 1.3 Notes About the Implementation ...... 3 1.4 Structure of is esis ...... 3

2 Proactive Spam Fighting 5 2.1 Related Work ...... 5 2.1.1 Bayesian Filtering ...... 5 2.1.2 DNS Blalists ...... 6 2.1.3 URI Blalist ...... 7 2.1.4 Greylisting ...... 7 2.1.5 Conclusion ...... 8 2.2 e Mail-Shake Concept ...... 9 2.2.1 Proactive Spam Fighting With Dynamic Whitelists ...... 9 2.2.2 Limitations of the Mail-Shake Concept ...... 11 2.2.3 Summary ...... 17 2.3 e Spam Templates Concept ...... 18 2.3.1 Template Based Spam Mails ...... 18 2.3.2 Generation of Templates ...... 19 2.3.3 Proactive Filtering ...... 20 2.3.4 Summary ...... 21

3 Background 23 3.1 Evaluation of Current CAPTCHA Teniques ...... 23 3.1.1 Introduction ...... 23 3.1.2 Simple Obfuscation ...... 24 3.1.3 Image Based CAPTCHAs ...... 25 3.1.4 Audio Based CAPTCHAs ...... 26 3.1.5 Image Recognition CAPTCHAs ...... 26 3.1.6 Riddle ...... 27 3.1.7 reCAPTCHA ...... 29 3.1.8 Conclusion ...... 30 3.2 Excursus: Breaking a CAPTCHA System ...... 32 3.2.1 e Scr.im CAPTCHA System ...... 32 3.2.2 Flaws in the Design of the Scr.im CAPTCHA System ...... 32 3.2.3 Aa on the CAPTCHA System ...... 34 3.2.4 Lessons Learned ...... 36 VIII Contents

3.3 ...... 37 3.3.1 Plugins Compared to Central Storage ...... 37 3.3.2 Akonadi as the Central Storage Solution ...... 38 3.3.3 Design of Akonadi ...... 38 3.3.4 Summary ...... 40

4 Development of the Systems 41 4.1 Soware Requirements for Mail-Shake ...... 41 4.1.1 Answering Spam Messages ...... 41 4.1.2 Delivery Status Notifications ...... 42 4.1.3 Public Mail Address ...... 43 4.1.4 Sending Mails ...... 44 4.1.5 Private Mail Address ...... 45 4.1.6 Summary ...... 46 4.2 Design of Mail-Shake ...... 47 4.2.1 Client Independent Library ...... 47 4.2.2 Akonadi Agent ...... 50 4.2.3 Client Integration ...... 52 4.2.4 Summary ...... 52 4.3 Implementation of Mail-Shake ...... 54 4.3.1 Mail-Shake Library ...... 54 4.3.2 Mail-Shake Akonadi Agent ...... 69 4.3.3 Mail-Shake Integration in Clients ...... 76 4.4 Implementation of Spam Templates ...... 81 4.4.1 Generating the RSS Feed ...... 81 4.4.2 Testing a Mail ...... 83 4.4.3 Summary ...... 87

5 Evaluation 89 5.1 Mail-Shake Evaluation Setup ...... 89 5.2 Results of Mail-Shake Evaluation ...... 90 5.3 Greylisting ...... 92 5.4 Results from January ...... 94 5.5 Results from February ...... 96 5.6 Summary ...... 97

6 Retrospection and Future Tasks 99 6.1 Problems caused by Akonadi ...... 99 6.2 Future tasks for Spam Templates ...... 101 6.3 Future Tasks for Mail-Shake ...... 101 6.3.1 Handling of Delivery Status Notifications ...... 102 6.3.2 Mail-Shake for Several Addresses ...... 102 6.3.3 Solving Mail-Shake Challenges in Email Clients ...... 103 6.3.4 Integrating Mail-Shake Directly Into Email Clients ...... 103 6.4 CAPTCHA Security ...... 104 Contents IX

7 Conclusion 105

A Examples of Delivery Status Notifications 113 A.1 RFC Compliant ...... 113 A.2 Exim ...... 114 A.3 QMail ...... 115 A.3.1 MIME Mail ...... 115 A.3.2 Plain Text Mail ...... 116 A.4 Google Mail ...... 117

B Mails from Automated Systems 119 B.1 Review Board ...... 119 B.2 Bugzilla ...... 119

C Mail-Shake API Documentation 121 C.1 MailShake Namespace Reference ...... 121 C.1.1 Detailed Description ...... 121 C.1.2 Typedef Documentation ...... 122 C.1.3 Enumeration Type Documentation ...... 122 C.2 MailShake::DSN Class Reference ...... 122 C.2.1 Detailed Description ...... 122 C.2.2 Member Function Documentation ...... 122 C.3 MailShake::DSNPrivate Class Reference ...... 123 C.4 MailShake::EMail Class Reference ...... 123 C.4.1 Detailed Description ...... 123 C.4.2 Member Function Documentation ...... 124 C.5 MailShake::EMailPrivate Class Reference ...... 126 C.6 MailShake::Id Class Reference ...... 126 C.6.1 Detailed Description ...... 126 C.6.2 Member Function Documentation ...... 126 C.7 MailShake::IdPrivate Class Reference ...... 127 C.8 MailShake::MailShake Class Reference ...... 127 C.8.1 Detailed Description ...... 128 C.8.2 Member Function Documentation ...... 128 C.9 MailShake::MailShakePrivate Class Reference ...... 131 C.9.1 Member Function Documentation ...... 131 C.9.2 Member Data Documentation ...... 132 C.10 MailShake::WhiteListEntry Class Reference ...... 132 C.10.1 Detailed Description ...... 133 C.10.2 Member Function Documentation ...... 133 C.11 MailShake::WhiteListEntryPrivate Class Reference ...... 134

D Mailman Archive Address Harvester 135 D.1 main.cpp ...... 135 D.2 mailmanharvester.h ...... 135 D.3 mailmanharvester.cpp ...... 136 X Contents

D.4 mailmanharvesterview.h ...... 137 D.5 mailmanharvesterview.cpp ...... 138 D.6 mailmanharvesterviewbase.ui ...... 139

E Automated Scr.im CAPTCHA Solver 141 E.1 main.cpp ...... 141 E.2 ScrimCraer.h ...... 141 E.3 ScrimCraer.cpp ...... 143 E.4 CMakeLists.txt ...... 147

F Dialog to Solve a Mail-Shake Challenge 149 F.1 mailshakedialog.h ...... 149 F.2 mailshakedialog.cpp ...... 150

G RSS Generator 155 G.1 main.cpp ...... 155 G.2 rssgenerator.h ...... 155 G.3 rssgenerator.cpp ...... 156 G.4 CMakeLists.txt ...... 158

H Spam Templates Library 159 H.1 template.h ...... 159 H.2 template.cpp ...... 160 H.3 templatemanager.h ...... 163 H.4 templatemanager.cpp ...... 163 H.5 mail.h ...... 164 H.6 mail.cpp ...... 165 XI

List of Figures

2.1 Overview of the Mail-Shake email process ...... 9 2.2 Example of a Mail-Shake allenge mail ...... 10 2.3 Leakage of private Mail-Shake address ...... 11 2.4 Mail-Shake authentication initiated on private address ...... 12 2.5 Mail loop triggered by a spam mail with a not valid sender address ...... 15 2.6 Web Service as rely of a mail ...... 16 2.7 Template based spamming ...... 19 2.8 Example of a generated Spam template ...... 20

3.1 Example of a reCAPTCHA ...... 25 3.2 CAPTCHA containing email address “[email protected]” ...... 25 3.3 Example of an Asirra CAPTCHA ...... 27 3.4 e words to be used for reCAPTCHA ...... 30 3.5 e scr.im CAPTCHA system ...... 33 3.6 Different images for the same scr.im CAPTCHA ...... 33 3.7 Comparison of original CAPTCHA image and the result of the pixel shader...... 35 3.8 Two different applications to handle public and private addresses ...... 37 3.9 Basic aspects of the Akonadi aritecture ...... 39 3.10 Components of Akonadi ...... 40

4.1 Abusing Mail-Shake to send spam ...... 42 4.2 Activity diagram for processing mails sent to the public address ...... 44 4.3 Activity diagram for sending mails ...... 45 4.4 Activity diagram for receiving mails on private address ...... 46 4.5 Classes EMail and DSN of the Mail-Shake library ...... 48 4.6 Class WhiteListEntry of the Mail-Shake library ...... 49 4.7 High Level Class Diagram of the Mail-Shake library ...... 49 4.8 High Level Class Diagram of Mail-Shake’s client side implementation ...... 50 4.9 Communication between Akonadi server, agent and Mail-Shake library ...... 51 4.10 Class diagram for Mail-Shake integration ...... 52 4.11 Classes EMail and DSN split in interface and implementation classes ...... 55 4.12 Template of a Mail-Shake allenge ...... 73 4.13 Dialogs to configure the whitelist ...... 74 4.14 Notification upon receipt of not whitelisted mail ...... 75 4.15 Message View with Mail-Shake allenge mail integration ...... 77 4.16 Dialogs to solve the Mail-Shake allenge ...... 79 4.17 Configuration for determining the mating score ...... 83

5.1 Rejected mails in December 2009 on the evaluated MTA...... 93 XII List of Figures

5.2 Rejected, bounced and junk mails in January 2010 on the evaluated MTA...... 94

6.1 Mail-Shake Agents in the systray ...... 103

D.1 Application for extracting addresses from Mailman arives ...... 140 XIII

List of Tables

4.1 Examples for subjects containing a Mail-Shake id ...... 63 4.2 Size of Mail-Shake measured in Source Lines of Code ...... 69 4.3 Database structure of Mail-Shake agent ...... 71 4.4 Mail headers used in Mail-Shake allenge and notification mails ...... 73 4.5 Changed files for Mail-Shake allenge integration in Mailody ...... 78 4.6 Command line options for the RSS generation tool ...... 81

5.1 Private and public addresses used during the Mail-Shake evaluation ...... 90 5.2 Number of mails filtered by Mail-Shake in January 2010 for the different addresses . 94 5.3 Statistics for filtered mail per address in January ...... 95 5.4 Number of mails filtered by Mail-Shake in February 2010 for the different addresses 97 5.5 Statistics for filtered mail per address in February ...... 97

XV

List of Listings

3.1 Pixel shader for extracting aracters from the scr.im CAPTCHAs ...... 35 4.1 Mating a string against a whitelist entry ...... 58 4.2 Trivial algorithm to e if a mail is whitelisted ...... 59 4.3 Comparing the whitelist entries to a given datum ...... 60 4.4 Improved algorithm to test if a mail is whitelisted based on a smarter data structure 60 4.5 Handling the receipt of a mail sent to the public address ...... 61 4.6 Generating a new unique identifier ...... 62 4.7 Extracting the Mail-Shake Id from a mail subject ...... 62 4.8 Cheing if received private mail is whitelisted or a DSN ...... 64 4.9 Cheing if mail contains a allenge response Id or is on temporary whitelist ... 65 4.10 Move an entry from temporary to permanent whitelist or create a new one...... 66 4.11 Adding a whitelist entry for ea recipient of a sent mail ...... 66 4.12 Connecting a slot to the with the boost library ...... 69 4.13 Slot for removing one Id from the storage ...... 70 4.14 Connecting Signals and Slots with Qt ...... 70 4.15 Feting a mail sent to the public address ...... 71 4.16 Extracting headers from a KMime message ...... 72 4.17 Extracting the Mail-Shake headers in Mailody ...... 76 4.18 Displaying Mail-Shake allenge information in Mailody’s header widget ...... 77 4.19 Intercepting a cli on a link in order to open the Mail-Shake allenge dialog ... 78 4.20 Extracting CAPTCHA from the reCAPTCHA web page ...... 79 4.21 Testing if the web page contains the revealed mail address ...... 80 4.22 Generating an RSS item from one template file ...... 82 4.23 Generated RSS feed containing one template ...... 82 4.24 Algorithm for mating a mail body against a template ...... 85 4.25 Forward and baward sear for a mating line based on fuzziness ...... 86

1

1 Introduction

1.1 Motivation

Unsolicited bulk emails or in general spam or junk mails have become one of the greatest allenges of current global communication. About 80 percent of the world’s email communication is not legitimate[10]. is includes not only spam mails but also malicious soware and phishing mails. ese mails cause a global economic loss of EUR 36 billion ea year plus EUR ten billion lost due to fraud[3]. 33 billion kWh are required to process the 62 trillion spam mails ea year and 104 billion user hours are required to e and delete these junk mails[42]. Unfortunately sending spam messages is a profitable business: in the year 2002 a study showed that out of 3.5 million sent messages 81 sales were generated in the first week of the campaign result- ing in an income of USD 1,500[48]. ese numbers can be confirmed with more recent information unleashed by a former spammer: sending 40 million mails can render a weekly income of USD 37,440[58]. Spam is also one of the reasons why there is malicious soware at all. Next to Distributed Denial of Service aas (DDoS), botnets are used to send spam mails[50]. About 10 million zombie com- puters organized in botnets are actively sending out spam and email-based malicious soware. As the zombies are added and removed dynamically to prevent static blalist solutions from bloing the zombies[9], it can be assumed that there are many more computers controlled by the bots. A single zombie of a Storm botnet sends an average of 1.04 spam mails per second up to 136,000 mails per day[15]. As long as there are enough people buying products advertised by spam or following the hyper- links in phishing mails there will be spam. It is unlikely that this social problem can be solved by soware. Of course modern web browsers can help to protect the users against fraud like phish- ing, but in the end only a beer education will prevent that people will be defrauded by spam and phishing. As well modern soware cannot protect end-users, who use outdated soware and do not care or are unable to update to a more recent and secure version. is implies that there has to be some effort to reduce the number of received junk mails and to lower the risk of being defrauded by phishing. erefore it is required that unsolicited mails are recognized in a reliable way whi does not require manual control. 2 1 Introduction

Current spam fighting teniques like Bayesian filters or Uniform Resource Identifier Blalists (URIBL), whi are discussed in Chapter 2.1, are commonly reactive. ey require a large set of received spam messages to extract features su as URIs referenced in a message. With the help of the extracted features the algorithms can distinguish spam from ham messages (valid messages). But this reactive approa has disadvantages because it must first receive the spam messages. As long as new features are not extracted, the teniques cannot identify messages as junk. is is an annoyance for users as the teniques produce false negative results and the users have to delete the unrecognized spam manually. Spam fighting has to become proactive: preventing that spam messages can be delivered to the end-users’ mailboxes at all or at least provide spam recognition solutions, whi are able to remove messages, based on new spam paerns, at the same time as the new paern is used for the first time.

1.2 Proactive Spam Fighting Techniques

In this thesis the implementation of two proactive spam fighting teniques are discussed. ese teniques aim to prevent that spam mails can be delivered to users at all and to recognize new spam faster and in a more reliable way. e first tenique, called Mail-Shake, is a concept whi prevents spam or at least makes it more difficult for spammers to send spam. erefore ea sender has to authenticate once that he is a human. Mails sent from unauthenticated senders are dropped automatically and by that spammers are unable to deliver their junk. is concept is discussed in more detail in Chapter 2.2. e second tenique helps to identify received spam mails in a more reliable way. By intercepting mails sent by a bot, generic templates are generated and used to identify spam mails even if other teniques are not yet able to recognize the email as spam. e construction and usage of Spam Templates is discussed in more detail in Chapter 2.3. e hope is that these teniques help reduce the number of spam mails received by users and the time whi is required to e for and sort false positives and negatives. e Mail-Shake concept is immune against false positives as only mails sent by computers are classified as spam mails. e Spam templates on the other hand will not mark mails sent by humans as spam because the template is constructed in a way to only mat mails sent by a bot. ese teniques can and should be used in combination with other existing spam fighting te- niques su as Greylisting, blalists and Bayes filtering systems whi are discussed briefly in Chapter 2.1. 1.3 Notes About the Implementation 3

1.3 Notes About the Implementation

e two teniques, Mail-Shake and Spam templates, are developed independently but using the same libraries and tenologies. Both applications are built upon the Personal Information Man- agement (PIM) framework developed and used by the KDE community. is framework, called Akonadi, is completely client and platform independent, whi is currently (and other Unixes), Microso Windows and Mac OS X. As the underlying KDE and Qt libraries are being ported to more platforms su as smart phones, Akonadi will probably become available on those as well. Although Akonadi has been developed for the usage in KDE’s PIM suite “” it was designed with client independence in mind. So there are already different KDE applications available, whi use Akonadi, and some prototype applications developed in different programming languages and with different GUI libraries. e Akonadi framework is discussed in Chapter 3.3. e combination of platform and client independence has the advantage that the applications developed in the scope of this thesis can be used with different email clients. Nevertheless the applications are developed in a way so that its code can easily be reused by other projects to provide a more native integration. erefore an abstraction layer is implemented and used.

1.4 Structure of This Thesis

In the current Chapter a short introduction and motivation for implementing proactive spam fighting teniques was presented. e applications, whose implementation are discussed in this thesis, were named and a short introduction to the framework used to develop the applications was provided. e following Chapter 2 discusses the proactive spam fighting teniques. First of all related work, in this case other existing but reactive spam fighting teniques, is presented. is motivates the discussion of the two teniques: Mail-Shake and Spam Templates. Before the implementation can be discussed, an overview on the baground of the system is provided in Chapter 3. is includes an evaluation of current CAPTCHA¹ teniques in Chapter 3.1 required for implementing Mail-Shake and in Chapter 3.2 an example for an automated solution to break a CAPTCHA system is presented as an excursus, whi motivates the osen solution to not implement its own CAPTCHA, but to rely on existing and tested functionality. Last but not least a closer look at the KDE personal information management framework Akonadi in Chapter 3.3 completes the apter on the baground of the system. e discussion of the development of the system is encapsulated in Chapter 4. First of all the soware requirements (Chapter 4.1) are presented, followed by design (Chapter 4.2), the actual im- plementation of Mail-Shake in Chapter 4.3 and Spam templates in Chapter 4.4.

¹Completely Automated Public Turing test to tell Computers and Humans Apart 4 1 Introduction

e following Chapter 5 evaluates the results. is shows if the concepts presented in this thesis are able to reduce the number of received spam mail and if the concepts are usable at all. e implementation allows the easy reuse in different client implementations. Some possibilities for future work and a retrospection are named and presented in Chapter 6. Last but not least a conclusion for the results of this thesis are presented in the last Chapter 7. 5

2 Proactive Spam Fighting

In this Chapter the two concepts Mail-Shake and Spam Templates are discussed. Both concepts are proactive spam fighting teniques and are able to eliminate spam messages before they are shown to the end-user. is is an important difference to the existing, but reactive ones. Some of those teniques are also presented in this Chapter.

2.1 Related Work

In this Section other existing spam fighting teniques are presented. Most of those teniques are reactive and share the disadvantages of reactive approaes. A brief overview of tenologies like Bayesian filtering, blalists and greylisting are provided and their advantages and disadvantages are discussed.

2.1.1 Bayesian Filtering

e most common spam fighting teniques are the Bayesian and rule-based filtering systems as used for example by Spam Assassin¹. ese are examples for reactive spam fighting solutions: a large repository of both spam and ham messages is required to extract features from all mes- sages. ese extracted features can be used to distinguish spam from ham messages via a Bayesian model[55]. Rule-based filtering systems are reactive as well. For constructing a rule it is required to first look on the spam messages to construct the rule. Using rules for spam filtering is rather limited as the logical rule set makes binary decisions whether to classify a given mail as spam[55]. is can easily result in false positives, as seen in January 2010 when the dates grossly in the future became present for Spam Assassin[29]. e misbehaving rule tests for dates in the year 2010 or later and ea message receives an additional score between 2.075 and 3.554. As Spam Assassin classifies a message as spam at a score of 5.0 this rule causes many false positives. ese limitations of rule-based filtering systems can be circumvent by feature extraction and the use of Bayesian filtering systems. Nevertheless a Bayesian system is not the perfect solution as well. For example it can only extract features from text messages and is unable to filter image

¹http://spamassassin.apache.org/ 6 2 Proactive Spam Fighting based spam. e number of image based spam increased significantly in 2006[8] and the images are distorted by applying teniques used for CAPTCHAs, so that computers are unable to restore the original image[67].

2.1.2 DNS Blacklists

One of the most common teniques to blo spam mails directly on the mail server is the use of a DNS blalist (DNSBL). e name refers to the fact that the blalist is queried with the help of the Domain Name System (DNS). To test if a given IP address a.b.c.d is enlisted in a certain blalist the mail server just has to query for the A record for the address d.c.b.a.blalist-name[30]. If the query is successful the mail should be rejected as the sender’s IP address is known to send spam. DNSBLs are of course a reactive spam fighting approa. A given IP address has to be verified to be used for spamming. e important question is if the IP addresses of bots get listed while the bot is actively sending out spam messages. A study from 2005 shows that DNSBLs are not capable to blo spam sent by botnets. Out of 4,295 IP addresses, whi were known to be part of the Bobax botnet, only 225 were blalisted in the DNSBL provided by Spamhaus²[52]. On the other hand a blalist might easily blo legitimate senders. For example if the IP address of a bot is assigned dynamically by its Internet Service Provider (ISP), the ISP might have assigned the same IP address to a different customer at the time the DNSBL includes this address. So the actual bot is not bloed, but a legitimate user is bloed. An empirical study showed, that 80 % of the IP addresses of possible spammers in February 2004 were still listed in at least one of seven popular DNSBLs two month later. Some of the IP addresses were already present in the DNSBLs in the year 2000[30]. e fact that a DNSBL can blo any domain from sending mails is also a great disadvantage as the DNSBL can be abused. In 2007 the popular Spamhaus project demanded that the Austrian Network Information Center “nic.at“ takes down addresses used for phishing. As the registrar did not react, Spamhaus started to “blamail” the registrar by enlisting its domain, so that nic.at could not send mails anymore[47, 51]. DNSBLs seem not to be an appropriate method for spam fighting any more. e reactive approa is unable to scope with the frequently anging IP addresses of spam sending bots and the ances that legitimate senders are bloed is too high. Especially the incident between Spamhaus and nic.at illustrate that the disadvantages of DNSBLs prevail.

²http://www.spamhaus.org/ 2.1 Related Work 7

2.1.3 URI Blacklist

A different form of blalists are the Uniform Resource Identifier Blalists (URIBL). Instead of bla- listing the IP address of senders, domain names referenced in mail bodies more oen than a given threshold are included in the blalist. A given mail is analyzed if it contains an URI to su a blalisted domain name and in that case the mail is classified as spam[33]. In opposite to the DNSBLs the complete mail has to be received and the content has to be analyzed. e approa is reactive and requires a large set of both spam and ham messages as the presence of an URL in the message body is not a reliable indicator for spam. Almost 90 % of legitimate mails contain URLs as well[33]. An advantage of URIBLs compared to other teniques is, that it only analyzes the URLs in the message body. On the other hand the approa easily produces false negative results as it requires the presence of an URL. If a spam message does not contain an URL, as it is for example image spam, the message cannot be classified as spam. Given the fact that URIBLs produce false negative results, it cannot be used as an own spam fighting solution, but has to be combined with other teniques. So the fact that a mail has been classified as spam by using an URIBL should only be seen as an indicator for spam.

2.1.4 Greylisting

Greylisting is a combination of a bla- and a whitelist with automatic whitelist management. Ea new received mail is initially rejected on the Mail Transfer Agent (MTA) and the unique triplet of IP address of sending host, sender address and recipient address in the envelope is stored. If the sending host tries to deliver the mail again aer a defined delay, the mail is accepted and the triplet is moved to the whitelist. All further communication from this triplet will not be delayed[26]. Greylisting is based on the assumption that spam sending applications are using a “fire-and- forget” approa. If a spam message cannot be delivered the application does not try to resend the message, although temporary failures are always possible. e first testing of greylisting in mid-2003 showed an effectiveness of 95 %[26]. Unfortunately this effectiveness is based on the fact that spam sending applications do not implement SMTP correctly. By adopting the spam sending applications to circumvent the protection provided by greylisting, the success rate can be decreased. In Chapter 5.3 on page 92 an evaluation of the current effectiveness of greylisting is provided. e greylisting approa has some disadvantages. First of all ea legitimate mail is delayed if a sender tries to deliver a mail for the first time. In case that several MTAs are used to send outgoing mails, it is possible that ea mail is delayed as the IP address stored in the unique triplet anges for ea mail. It is even possible that legitimate mails are bloed completely because the hosts do not retry to send the mail or handle the temporary failure as a permanent and return the mail to the end-user[37]. 8 2 Proactive Spam Fighting

Even when greylisting breaks because spammers adopt their used applications, it is useful to continue to use greylisting. Basically greylisting bounds resources on the spammer’s side. e spammer has to use a mail queue and cannot continue to use a fire-and-forget approa. Due to the fact that the host’s IP address is part of the unique triplet, the same bot has to send the message aer the delay. ere is the ance that at this time reactive approaes as for example DNSBL include the bot’s IP address in the blalists and so the spam message can be bloed, although the greylist is overcome.

2.1.5 Conclusion

As this Section illustrated none of the presented existing teniques is able to reliably distinguish spam from ham messages. Most of the existing teniques are reactive and require that first a large set of false negative results is generated. Based on these false negative results the teniques can be improved to identify spam messages in future. But this is of course an annoyance for the end user as unfiltered messages appear in the mailbox and has to filter those manually. A more proactive spam fighting approa is required. Spam messages have to be identified before the messages hit the end-users mailbox. Greylisting is a step in the right direction as it blos spam- mers, but it is only a solution till it is commonly used, as the spammers will adjust their soware. 2.2 e Mail-Shake Concept 9

2.2 The Mail-Shake Concept

In this Section the Mail-Shake concept as described in [19] is discussed. First of all the idea is pre- sented followed by a discussion how and why the concept works and finally some of the limitations will be named and how to circumvent these.

2.2.1 Proactive Spam Fighting With Dynamic Whitelists

e basic idea behind Mail-Shake is to blo all mails from unauthenticated senders and to provide senders an easy way to authenticate themselves. e process of authentication is done in a way so that humans are able to participate, while computers - and by that spam bots - are not. Aer authentication the sender’s address is put on a whitelist. is whitelist is used by Mail-Shake to decide if a mail is authorized or not. By that the concept is proactive as it blos spam before it is read by the user. .

.send initial email .User A (private address) .(recipient placed on whitelist)

.User B (public address) .reply with allenge .(and random ID)

.resend initial email .User A (private address) .(and ID in subject) .(update whitelist entry)

.User B (private. address) .future communication .(address placed on whitelist) .User A (private address)

.User B (private address)

Figure 2.1: Overview of the Mail-Shake email process[19]

ese initial steps of authentication are illustrated in Figure 2.1. A sender (User A) has to send a mail to User B’s public email address. All mails sent to the public address are discarded, but answered with a allenge mail containing a unique identifier. User A has to solve the allenge, 10 2 Proactive Spam Fighting whi reveals User B’s private address. Now User A can resend the original mail with the identifier in the subject. Mail-Shake compares the identifier and put User A’s address on a whitelist. In future User A can send mails directly to User B’s private address. e authentication step is required only once. As well there is no need to include the identifier in ea single mail. Other mails sent to the private address are discarded if the sender address is not on the whitelist. e allenge, whi reveals the private email address, has to be in a way that it is solvable by a human and not by a computer - that is a kind of a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart). An example for su a allenge mail is presented in Figure 2.2. e actual allenge is implemented by relying on the reCAPTCHA web service, whi offers the possibility to protect an email address with a CAPTCHA. A Mail-Shake user can publish the public address openly in the web. If a spam bot gains this address, all spam mails sent to the public address are not read and the spam bot is unable to gain the private address from the Mail- Shake allenge mails.

Subject: Mail-Shake challenge You sent an email to a public Mail-Shake address. The email will not be delivered. You have to send the email to the private address. You can retrieve the private address by visiting the following web address and solving the shown CAPTCHA: http://mailhide.recaptcha.net/d?k=01CnVIbRzbs1dYsDFRJi_3RQ==&c=6pdRWDUzBNLbqFDUM-P8vMAb9FJMDoP3HqWDsEQZPoI=

The email to the private address will only be delivered if you include the following text in the subject of your email:

Mail-Shake Id: 37735

In future you can send emails directly to the private email address as normal. If you did not send an email to the public address you can ignore this email.

Figure 2.2: Example of a Mail-Shake allenge mail

If a spam bot gains the private email address, the bot is still unable to send junk to the Mail-Shake user. e address used in the spam mails are not on the whitelist and the mails are dropped. So a spam bot is unable to send spam mails without going the authentication steps. As it might be possible that a spam bot gains both public and private address, the unique identifier is introduced. Without the identifier it would be sufficient to just send a mail to the public followed by sending a mail to the private address. Introducing the unique identifier ensures that a Mail-Shake allenge mail has to be received. By that all spam bots using forged sender addresses are unable to receive the required unique identifier. e concept discussed so far is unusable in case that two Mail-Shake users try to communicate with ea other. User A tries to send a mail to User B’s public address. As with all outgoing com- munication, User A sends the mail from his private address. User B’s Mail-Shake setup therefore sends the allenge mail to a private address. As there has not been any communication before, 2.2 e Mail-Shake Concept 11

the allenge mail is dropped. To circumvent this problem, a temporary whitelist is introduced. Whenever a Mail-Shake user sends a mail the recipient address is added to the temporary whitelist. A response mail is therefore not bloed and the allenge mail can be received. e temporary whitelist also guarantees that all communication started by a Mail-Shake user is possible. Mail-Shake makes it more difficult for spammers to deliver their junk. It requires to solve one CAPTCHA for ea recipient address. at increases the costs to send spam as long as the CAPTCHA is secure and has to be solved by a human. Even if a spammer knows both private and public address, he needs to be able to receive the allenge mail to get on the whitelist. e user will most likely delete the entry from the whitelist and so for ea spam mail to be sent, the spammer has to send two mails and receive one. Nevertheless there is one possible way to circumvent the Mail-Shake protection by infecting a system with malware and sending spam to all gathered addresses on this system disguised as the valid sender. In that case a Mail-Shake user will remove the address and Mail-Shake should inform the sender as it is a strong indicator that the system is infected.

2.2.2 Limitations of the Mail-Shake Concept

2.2.2.1 Leaking of Private Address

To assume that the private address will not be leaked is rather naive. If User A has a Mail-Shake protected setup and User B is authorized to send mails to User A’s private address, the address leaks if User B sends one mail to User A and a third User C. In case that User C answers to both User B and User A the mail is dropped as illustrated in Figure 2.3(a). Another scenario for leakage is, that the Mail-Shake User A sends a mail to User B, who is not . authorized.. Mail-Shake adds User B’s address to the temporary whitelist automatically, so that response mails are accepted. In this scenario User B does not even know that User A is using Mail- Shake. By forwarding User A’s mail to User C the private address is leaked. If User C tries to send

.User A .User B .User C .User A .User B .User C .1 .1 .1 .2

. . . . .private .private .address .2 .address .3 . . . . (a) 1.: User B sends one mail to both the private ad- (b) 1.: User A sends mail to User B and adds B’s ad- dress of User A and to User C. 2.: User C sends a dress to his temporary whitelist. 2.: User B forwards mail to User A, whi is dropped as his address is mail to User C. 3.: User C sends mail to User A, whi not on the whitelist. is dropped as his address is not on the whitelist.

Figure 2.3: Leakage of private Mail-Shake address 12 2 Proactive Spam Fighting

. .Sender .Receiver . .1 . .2 .1. Sends mail to private address .2. Drops not whitelisted mail .3 .private address .3. Sends notification mail with .4 .allenge for public address .5 .4. Solves allenge, . .6 .receives public address .5. Sends mail to public address .7 .6. Generates Id and adds it to storage .public address .Storage .7. Sends regular allenge mail .8 .8. Resends original mail .9 .with Id in subject .9. Updates the whitelist .private address

Figure 2.4: Mail-Shake authentication initiated on private address

a mail to User A the mail is dropped. is scenario is illustrated in Figure 2.3(b). Given these two scenarios it is unacceptable to drop mails without any further notice to neither receiver nor sender as proposed in the Mail-Shake concept paper. While this is an annoyance for private mail communications, it might have legal consequences in corporate or governmental in- stitutions. Also a financial loss is possible if for example orders are sent to a Mail-Shake protected private address. Of course a notice to the user of Mail-Shake (User A) does not make sense as it requires to manually e all filtered mails and in that case Mail-Shake does not provide any advantages in comparison to existing solutions. erefore the sender (User B) has to be notified, that the mail has been dropped. is notification must offer a way to send mails to User A. Publishing User A’s public address in the notification is no option as then both public and private address are known to User B. So he only needs to generate a unique identifier by sending a mail to the public address in order to send mails to the private address in future. is implies that he does not have to authenticate himself as a human. erefore the notification must include the public address as a allenge. e workflow for authentication in this scenario is illustrated in Figure 2.4. e allenge reveals User A’s public address and User B has to send a mail to the public address in order to generate an identifier. is identifier has to send with the original mail to the already known private address. is workflow requires one more mail to be sent and makes the complete process more complex. Another approa to solve this limitation is to include the unique identifier in the notification and to add the sender’s address to the whitelist when a mail with the correct Id is sent to the public address. is approa has not been pursued as mails to the public address are not delivered to the end-user. While it allows to add an address to the whitelist, the sender has to resend the original 2.2 e Mail-Shake Concept 13 mail to the private address nevertheless. As well there is the ance that a user expects the public address to be the private one if he knows the Mail-Shake concept.

2.2.2.2 Communication with Automated Systems

ere is a limitation in the usability in conjunction with all kinds of communication with automated systems, su as mailing lists, bulletin boards and online shops. e idea behind Mail-Shake is to only accept mails from senders, who authenticate themselves as humans. In case of su automated systems we want to receive mails although the sender is not a human. e way to authenticate an address does not work for e.g. newsleers. In most cases an automated system does not accept mails at all and even if it does, there is nobody solving the allenge and anging the address. So the usage of the public address is out of bounds for communication with automated systems. is implies that every time when mails from an automated system are expected (e.g. generating a new account in an online shop) the private address has to be specified. At this point ea mail sent by the automated system is dropped as the whitelist does not yet contain an entry for the address. It is the user’s obligation to add the address manually. Unfortunately the address, whi is used by the system, is in general unknown to the user. A possible solution to this problem is to ignore it. In case that not whitelisted mails are not deleted and just moved to a different mail folder, the user can e this folder for the mail. Of course this is a very suboptimal solution as the user has to look through a folder full of junk mails. As well this is no solution in case mails are automatically deleted as the original approa suggested. In [19] a proposed solution to this limitation is to allow wildcards in whitelist entries. at is in case the user created an account at the online shop “Foo”, he can manually create an entry that mates “*foo*”. Of course this does not only mat mails sent by the shop, but also junk mail disguised as mails sent from this shop. So either the user has to ange the entry as soon as the first mail sent from this shop has been received or specify a more specific rule like “*@foo.de”. But this would not mat for example mails sent from addresses like “[email protected]”. Adding more wildcards to the domain part of the address does not solve this problem as it opens the door to spammers using domains like “foo.bar.de”, whi would be mated by an entry like “*@*foo*”. e implementation, whi is discussed later in this thesis, supports these two possibilities to circumvent this limitation of communication with automated systems. But it also supports a third one: instead of dropping the mail directly a processing delay is introduced. e user receives a notification in the user interface, that a mail will be dropped and he can add the mail to the whitelist before being dropped. To provide beer usability the notification can be toggled on and off. So before a user registers himself on a web shop he can activate the notifications, wait till the first mail of this automated system is received and turn off the notifications. Waiting for the first received mail allows to whitelist exactly the domain used by the automated 14 2 Proactive Spam Fighting system, so that the entry should not mat junk mails using parts of the domain name in their sender address. Unfortunately the assumption that all mails from one web shop will use the domain used in the first mail does not hold as the evaluation (see Chapter 5.2 on page 90) showed. If an online shop Foo is a subsidiary of company, Baz the registration mail might be sent from domain “foo.de” while further communication is sent from “baz.de”. In su a case a false positive is generated. On the other hand this proofs that the Mail-Shake concept works correctly.

2.2.2.3 Sending Notifications in Reply to Spam

A limitation not considered at all in the Mail-Shake paper[19] are Mail-Shake allenges or notifi- cations in reply to the receipt of spam mails. In case that the sender address is valid, but forged, a allenge is sent to a user who did not request it. If the user is using Mail-Shake himself the al- lenge is dropped without bothering the user. In case he is not using Mail-Shake he receives one or in worse cases many unwanted notification. ese bascaered mails are of course unwanted and can even be considered as spam. e consequences could be that rule based spam fighting teniques are trained to filter out Mail-Shake allenge or that the MTA³ sending the allenges is set on a blalist. is would mean that a user of Mail-Shake is either unable to send mails or that users who want to send him mails are unable to go through the Mail-Shake authentication as the allenge mails are dropped automatically. In case that the sender address of a spam mail is forged, but not valid, the situation is slightly different. Mail-Shake tries to send a allenge to this address, but this cannot succeed as the address does not exist. e MTA responses with a delivery status notification mail informing the end- user that the mail delivery failed. is notification is sent to the sender address of the Mail-Shake allenge, whi is the public address. Of course Mail-Shake generates another allenge sent to the address of the MTA. In case that the MTA does not accept mails another delivery status notification is sent to the public address. At that point Mail-Shake is caught in a mail sending loop as illustrated in Figure 2.5. Ea delivery status notification sent by the MTA causes another allenge mail, ea allenge mail causes another delivery status notification. If notifications are sent in reply to mails received at the private address, the situation becomes more complex. Of course a mail loop can be triggered as well. A problem is that delivery status notifications in general may not be dropped automatically as they might be a valid mail in case a mail sent by the user could not be delivered. erefore Mail-Shake must be able to distinguish delivery status notifications sent in reply to a Mail-Shake mail from those in reply to a user mail. With RFC 3464[44] a specification for the format of delivery status notifications (DSN) exists. Unfortunately not all MTAs implement this specification, although the first version (RFC 1894) was published in 1996. During the evaluation (compare Chapter 5) non compliant delivery status no-

³Mail Transfer Agent 2.2 e Mail-Shake Concept 15

. .spam email .(with non valid sender address) .Spam Bot

.Mail-Shake user .Challenge email

.undelivered .Challenge email

.MTA .Mail-Shake user .Delivery Status Notification

Figure 2.5: Mail loop triggered by a spam mail with a not valid sender address

tifications sent from Exim, qmail and the MTA of Google Mail have been received. While the notifications sent by the first one offer a minimal ance to be recognized as a notification, the laer ones do not. e notifications are normal plain text mails with the original, undelivered mail pasted into the text body. A compliant DSN uses a special MIME ( Multipurpose Internet Mail Extensions) type⁴ and provides the undelivered mail as an aament. Appendix A contains examples for both compliant and non-compliant delivery status notifications received during the evaluation. Mail-Shake has to be able to recognize a DSN and not send allenges or notifications in reply to the receipt of a DSN. Furthermore at the private address Mail-Shake has to only drop DSNs in reply to Mail-Shake notifications. e only way to recognize if a DSN is in reply to a Mail-Shake allenge is the aaed undelivered mail whi is specified as optional in RFC 3464. While this is in general positive as it blos bascaered spam, for Mail-Shake it is a problem. Fortunately the evaluation showed that all standard compliant DSNs either aa the complete mail or at least the header, whi is sufficient to recognize a Mail-Shake mail. In case of non compliant notifications su as the one sent by Exim there is only the oice to either drop all notifications or to allow all notifications. So to say the oice between false positives or false negatives. e handling of delivery status notifications as proposed in this Section weakens the Mail-Shake concept. It is possible to successfully send mails to the private address without the requirement to authenticate. A spammer would only have to disguise the spam as a DSN. In case of a standard

⁴multipart/report 16 2 Proactive Spam Fighting

. .User of .Web Service .Web Service .Mail-Shake User . .1 .2

.1. User sends message via Web Service .3 .2. Web Service relys message as mail .3. Mail-Shake discards message

Figure 2.6: Web Service as rely of a mail

compliant delivery status notification there is at least the ance that with extensions to the email client su mails can be recognized.

2.2.2.4 Web Services

Another limitation in the usability of Mail-Shake, whi can be considered as a variant of the com- munication with automated systems, was found during the evaluation: Mail-Shake is unable to handle mails sent from web services su as social networking services. Consider the case that User A is using Mail-Shake and User B is an authorized sender and knows the private address of User A. User B is also a user of social network Foo, while User A is not a user of that network. User B wants to invite User A to join that network. erefore he gives User A’s private address to Foo and Foo sends an invitation mail to User A. is mail is of course dropped as it uses a not whitelisted address of Foo and not the whitelisted mail of User B. e web service is so to say a mail rely whi anges the sender address as illustrated in Figure 2.6. In case User B specifies User A’s public address it fails as well, as the allenge is sent to Foo and as this is an automated system it cannot solve the allenge. In fact the evaluation showed that a bounce mail might be sent stating that you cannot send mails to that address. As this mail triggers another allenge mail, Mail-Shake and Foo are stu in a mail sending loop similar to the one seen above in the case of DSNs. In fact Mail-Shake behaves exactly the way the user expects it to work. Although the private address leaked, it is useless for the social network. e user’s privacy is still protected by Mail- Shake. e way to invite someone to a social network or a similar web service should be done by using existing communication annels and not to give private information, su as email addresses, to a third party. By that Mail-Shake does not only prevent spam but also protects the user’s privacy. is limitation does not only occur for social networks, but for all cases where a web service forwards a mail or sends an invitation to the service. So for example invitation mails for services like “Google Wave”, are bloed as well. To solve this limitation Mail-Shake could ship predefined whitelist entries to allow mails sent from su known services. Of course there must be a way to 2.2 e Mail-Shake Concept 17 update this predefined whitelist for the case that new services are established or addresses ange. Also the case of purases via the popular auction platform eBay fail as the seller sends a mail to the buyer. e seller’s address is of course unknown to the buyer and the address is in that case not whitelisted. ere might be a workaround to wat for mails at the time the purase finishes or to only use the web frontend provided by the platform. A similar problem occurs for Review Board⁵, a web-based code reviewing tool used for example by the KDE community. e web tool knows the addresses of all participants and if User A opens a review request to User B, a mail is sent to User B from User A’s address. e header section does not contain any information, whi could be used to identify the mail as been sent from Review Board. In Appendix B an example of su a header section is provided and one from a system with useful headers. As Review Board is open source soware the easiest solution is to propose a pat to include a special header in ea mail.

2.2.3 Summary

In this Section the Mail-Shake concept has been presented. Mail-Shake protects an email account by using whitelists. All mails with a sender address, whi is not on the whitelist, are bloed. To get an address on the whitelist a sender has to proof that he is a human and not an automated system by solving a CAPTCHA. For the authentication process ea user of Mail-Shake has two addresses: a public and a private one. Ea mail to the public address is answered with a mail containing the CAPTCHA, whi reveals the private address. e concept is of course not bullet proof and some limitations of the concept and possible solutions to those were presented. e most severe problems are communication with automated systems and handling of Delivery Status Notifications. e solutions to these limitations are discussed in more detail in the scope of this thesis.

⁵http://www.reviewboard.org/ 18 2 Proactive Spam Fighting

2.3 The Spam Templates Concept

In this Section the concept of proactive spam filtering based on templates, as described in [25] is presented. e general idea is to generate templates by intercepting mails sent by spam bots. ese templates are used to identify new received mails as junk by mating the mail against the templates.

2.3.1 Template Based Spam Mails

Nowadays spam mails are mostly sent by botnets whi control a large number of systems infected with malicious soware (malware). ese controlled systems, whi are commonly known as bots or zombies, communicate with a control server to get the order to send out spam. Most large spam sending botnets, like the Storm Worm botnet and its successor the Waledac botnet[60], use a special tenique to generate and send spam messages[59] as illustrated in Figure 2.7. e control server passes templates, whi describes the structure of the spam messages to be sent, and meta-data su as recipient lists to the bots. e templates contain variable parts, whi are filled by the bots when sending out the messages with for example URLs received from the control server as well[34]. By intercepting the communication between the bots and the mail servers, they connect to for sending the spam messages, the templates can be reverse engineered. To intercept the communica- tion, probes of malware are executed in a sandbox, a controlled environment. e malware, whi is running on a native Microso Windows maine, is allowed to communicate with its control server to receive current templates and the list of target recipients. When the bot tries to start a SMTP⁶ connection, the connection is intercepted and redirected to a local mail server. e local mail server is the man-in-the-middle between bot and the mail server the bot wanted to connect to. To tri the bot into believing that it is communicating with the actual mail server, the local one has to connect to the “real” MTA and grab the banner and reply it to the bot. As the original template is passed from the control server to the bots, it seems to be a more elegant solution to intercept this communication, instead of intercepting the SMTP communication (and to reverse engineer the template). But gaining the original template might not be useful. Ea botnet uses its own template language, whi can be, as for example the Storm botnet illustrates, a fairly elaborate template language with support for formaing macros, generation of random numbers, dates, etc.[34] ese languages have to be reverse engineered and adjusted ea time the botnet slightly anges the language, whi renders the idea of spam fighting based on spam templates reactive. Another reason to reverse engineer the templates is, that not all botnets distribute their templates to the bots. ere are also botnets using a reverse proxy-based spamming tenique[49]. e bot connects to the control server and establishes a reverse SOCKS proxy connection. e control server uses this tunnel to directly send out the spam messages without passing the template

⁶Simple Mail Transfer Protocol 2.3 e Spam Templates Concept 19

Figure 2.7: Template based spamming[25] to the bot. By intercepting all SMTP communication only current spam messages are gathered. is has the advantage that when a new spam campaign is started the mails are already present. Existing teniques whi rely on feature extraction first have to gather many spam mails to be able to identify a new campaign resulting in a high false negative rate at the start of a new spam campaign. e idea of the Spam templates concept is, to generate the templates used by the bots. erefore a bot is executed for a certain amount of time or till a certain number of messages have been collected. Aerwards the system is reset and a different probe of malware is executed to receive messages sent by another botnet.

2.3.2 Generation of Templates

Aer one bot has been executed, templates can be reverse engineered from the collected spam mails. e messages are sorted, so that the longest message is processed first. e longest mail becomes the base template and by merging it with the other mails a template is generated. If the merge with one mail results in a too generic template, the new one is discarded and the mail is moved ba to the list of unprocessed mails. at guarantees that the template does not become too generic and only mails whi were generated from the original template end up in the template. As soon as all mails are processed or only mails are le whi render the template too generic, the template generation process ends[25]. An example for a generated template is provided in Figure 2.8. Subject, X-Mailer header and ea line of the message body are replaced by a regular expression. A disadvantage of this template generation process is, that the template can be too specific if 20 2 Proactive Spam Fighting

Subject\:\ ([\!\-\.\’\s\w]){7,137}\ X\-Mailer\:\ Microsoft\ Outlook\ Express\ 6\.00\.2720\.3000\ Body\:\ \#([\=\.\-\&\;\!\’\s\w]){20,152}\!\!\>\>\=09\ \.([A-Za-z]){14,14}Next\ Body\ Part\:\ \<\!DOCTYPE\ HTML\ PUBLIC\ \"\-\/\/W3C\/\/DTD\ HTML\ 4\.0\ Transitional\/\/EN\"\>\ \\\ \

Figure 2.8: Example of a generated Spam template the mails sent by one bot are always the same or only very few different mails are sent. So if the generated template is used to test mails, there can easily be a too high false negative rate as the mails vary to the intercepted.

2.3.3 Proactive Filtering

In the scope of this thesis a proactive spam filtering system based on the generated Spam templates are implemented. e system has to test new incoming mails against all templates and if the mail mates one of the templates, the mail can be classified as spam and be discarded. e system can fet new templates from the servers generating the templates, whi makes the approa proactive. Even before spam mails of a new campaign have been received, the template might already be generated and so mails of the new campaign are instantly recognized. In the worst case the templates become available aer a mail has been received. But even in this case new feted templates can be used to test already received, but unread mails. Current spam fighting teniques, whi test mails when they arrive at the mail server, do not test already received mails, when the rule set is updated. e Spam Templates are used on the client side, so that it is possible to test mails once on receipt and again when new templates become available. Generating the templates is not part of this thesis. A system for generating Spam templates was already developed at the Laboratory for Dependable Distributed Systems at the University of Mannheim. Unfortunately there is no productive environment whi generates current templates, so the implementation can only be used to test existing spam mails. In the scope of this thesis an 2.3 e Spam Templates Concept 21 algorithm is developed whi classifies a given mail as spam based on the generated Spam templates and a system to provide and fet new templates is elaborated. e algorithm is discussed in more detail in Chapter 4.4.

2.3.4 Summary

In this Section a short introduction to the ideas of spam filtering based on Spam templates has been presented. Current spam messages are generated from common templates and by reverse engineering these templates it is possible to classify a mail as spam in a reliable way. In the scope of this thesis the classification system is implemented, so that an email client can discard mails as junk aer downloading new templates.

23

3 Background

In this Chapter the essential baground for implementing the two proactive concepts is presented. First of all an evaluation of current CAPTCHA teniques is provided, as a secure CAPTCHA is required in Mail-Shake. An excursus on breaking an existing CAPTCHA system illustrates, that Mail-Shake has to use an existing and well tested system. Furthermore the framework Akonadi, on whi the two systems are built upon, is presented in this Chapter.

3.1 Evaluation of Current CAPTCHA Techniques

3.1.1 Introduction

One of the most important parts of the Mail-Shake approa is the allenge, whi reveals the private email address to the sender. e allenge is only solvable by humans and by that prevents that spam bots are able to send spam mails to the receiver using Mail-Shake. is allenge could be some form of CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart), a riddle, or simple text[19]. e most commonly used approa to implement this allenge is the CAPTCHA tenique. ese provide automatically generated tests, whi cannot be passed by current computer programs using the current state of Artificial Intelligence (AI)[66]. e fact that a CAPTCHA does not guar- antee that future computer programs will not be able to break it has some implications for the implementation: it has to be possible to easily ange the used CAPTCHA implementation. In this Section several currently used CAPTCHA implementations are evaluated. It is important to remember that CAPTCHAs are publicly available. So to say Kerhoffs’ principle is valid for CAPTCHA teniques as well. e strength of the CAPTCHA has to be determined by the difficulty to solve the AI problem and not by “security through obscurity”. As Mail-Shake will be published as Open Source it is known that an aaer will have access to the implementation details. ere are many websites whi use CAPTCHAs, whi do not require to solve a hard AI prob- lem, to prevent spam comments. An example is a CAPTCHA whi is just a simple mathematical equation as provided by the “ConfirmEdit” extension for MediaWiki¹. ese CAPTCHAs can be solved by computer programs. e strength of those CAPTCHAs is either the fact that there are

¹http://www.mediawiki.org/wiki/Extension:ConfirmEdit 24 3 Baground enough websites whi do not use CAPTCHAs or the fact that the CAPTCHA is placed at a random position in the HTML source whi makes it more difficult to parse. e CAPTCHA itself does not provide any additional security. For the implementation of Mail-Shake it is important to use only CAPTCHAs whi require to solve a hard AI problem. If the CAPTCHA does not add any security there is no need to use a CAPTCHA at all and the idea of separating humans from computers is broken[65]. It is an important requirement for Mail-Shake to use an existing and well-tested CAPTCHA im- plementation instead of a custom implementation. In the scope of this thesis the proof that a custom implementation is a hard AI problem cannot be given and by that a custom implementation has to be considered as insecure and unsuitable for Mail-Shake.

3.1.2 Simple Obfuscation

Many people and websites use a simple obfuscation to protect email addresses. is could be used by Mail-Shake as well. To obfuscate the email address special aracters are replaced by the textual representation. at way the email address is still readable, but if a bot tries to harvest the address, it is not able to find it because it is not a valid address any more and cannot be mated by a regular expression for email addresses as defined in RFC 822. For example the address

[email protected] could be obfuscated as

firstname [dot] lastname [at] example [dot] tld

e obfuscation does not fulfill the requirements of a strong CAPTCHA. It is based on a simple translation rule to replace special aracters with a textual representation. Of course this translation can be reversed. is tenique seems to be useful to prevent harvesters extracting mail addresses from arbitrary websites[21]. If the position of the obfuscated email address is well known the ob- fuscation does not add any security. For example the web arive of GNU Mailman² just replaces the “@” aracter by the textual representation, while the email address can be found always at the same position: the first hyperlink of the web page. is address can be extracted automatically as the position in the document is known and as well is the translation rule. In Appendix D the source code for a small application with only 270 lines of code³ is provided, whi is able to extract all email addresses from a Mailman arive. e result for extracting the addresses of one month in a public mailing list⁴ is shown in Figure D.1.

²http://www.gnu.org/software/mailman/index.html ³Source Lines of Code counted with David A. Wheeler’s ’SLOCCount’ ⁴Mailinglist plasma-devel@.org 3.1 Evaluation of Current CAPTCHA Teniques 25

Figure 3.1: Example of a reCAPTCHA

For Mail-Shake the same problem occurs: an aaer would know the position as well as the rules to retrieve the not obfuscated email address. Because of that obfuscating the email address is not a sufficient protection and cannot be used as a allenge in Mail-Shake.

3.1.3 Image Based CAPTCHAs

e first used CAPTCHAs are the image based CAPTCHAs. A word or any sequence of leers and digits is distorted and drawn above a noisy baground. e distortion allows humans to pass the test while most computer programs and OCR programs are not able to solve the test. Figure 3.1 illustrates an example of the reCAPTCHA system developed by the Carnegie Mellon University, since September 2009 a subsidiary of Google Inc.[2]. Although the image based CAPTCHAs are a hard AI problem it is possible to break them by using image processing algorithms for object recognition. With that approa it is possible to break for example CAPTCHAs generated by the EZ-Gimpy implementation in 92 % of the time[45]. In Chapter 3.2 an excursus on breaking a current image based CAPTCHA system is provided. Image based CAPTCHAs have the disadvantage that they cannot be solved by people with im- paired visibility. An image cannot be presented by a Braille display and so a user required to use a Braille display cannot solve the allenge. If the screen reader soware were able to present the CAPTCHA to a blind user, the CAPTCHA would not serve its purpose because bots would be able to figure it out as well[28]. is is a severe limitation of image based CAPTCHAs and because of that they cannot be used wherever accessibility has to be guaranteed su as governmental institutions. For Mail-Shake an email address has to be obfuscated by using an image as shown in Figure 3.2. is reveals some important information to a possible aaer. An email address has a well known structure containing exactly one “@” and at least one dot. e aaer can generate as many CAPTCHAs as needed and has the knowledge that ea CAPTCHA contains the same obfuscated address. Given these points it does not seem to be a good idea to use an image based CAPTCHA to

Figure 3.2: CAPTCHA containing email address “[email protected]” 26 3 Baground obfuscate the email address directly.

3.1.4 Audio Based CAPTCHAs

In a similar approa to image based CAPTCHAs aural CAPTCHAs can be generated. A sequence of random digits and aracters is generated and the spoken tokens are recorded on an audio clip together with some added noise. Unfortunately the amount of noise that has to be added to the clip to prevent Automatic Spee Recognition application from breaking the CAPTCHA renders the aural CAPTCHA as hardly usable[56]. Current aural CAPTCHAs can be broken by soware with a reliability of 58 %[62]. From an accessibility point of view an aural CAPTCHA does not solve the problems shown for image based CAPTCHAs. Blind people are able to solve the CAPTCHA but deaf people are not able to solve it. Although the audio CAPTCHA can be solved by blind people it is still very difficult for them. Screen readers oen speak over the playba of the audio CAPTCHA. is is one of the reasons why there is only a success rate of 43 % in solving audio CAPTCHAs for blind people[4]. For Mail-Shake the email address has to be recorded and distorted. If the listener does not understand the address correctly the Mail-Shake process fails. Given this disadvantage an aural CAPTCHA does not seem to be a solution for Mail-Shake.

3.1.5 Image Recognition CAPTCHAs

A different approa for image based CAPTCHAs are image recognition CAPTCHAs. Instead of presenting an image with distorted text several images are presented and the user has to identify the common object or identify an anomaly. e approa has the same limitations for users with impaired visibility as the image based CAPTCHAs. A proposed implementation uses an English pictorial dictionary, so every word is easy to illustrate. For ea of the words the first 20 hits from Google’s image sear are used to build up the database[6]. A disadvantage for the use in Mail-Shake is that it does not provide a way to conceal the email address. It can only be used for the unique identifier. In that case it has the disadvantage of spelling mistakes and different languages. While an English user would identify and answer with “dog”, a German user would answer “Hund”. Requiring a language would render the allenge unsolv- able for people not speaking that language. And a result submied in a different language is not distinguishable from an incorrect result for the Mail-Shake implementation. Another disadvantage is the limited set of words in the dictionary. An aaer could generate a database of all possible images if he has access to the dictionary whi has to be assumed by the requirement of publicity for CAPTCHAs. With the help of su an one time task an aaer would be able to solve the allenge. 3.1 Evaluation of Current CAPTCHA Teniques 27

Figure 3.3: Example of an Asirra CAPTCHA

ere exists an approa to overcome this problem by providing only images from dogs and cats. is concept is called “Asirra“⁵ and an example is provided in Figure 3.3. e user’s task is to identify all cats in a set of 12 images. As a database images provided by Petfinder.com, the world’s largest web site devoted to finding homes for homeless animals, is used. It contains over three million categorized images of cats and dogs and nearly 10,000 new images are added ea day[18]. Unfortunately the proposed implementation is aaable and there exists a classifier whi is 82.7 % accurate in telling apart the images of dogs and cats[24]. For Mail-Shake the Asirra approa seems not suitable as the central database cannot be shared and the implementation is web service centric. To solve the accessibility problems of audio and images based CAPTCHAs a system could be used whi combines those teniques. e image recognition could be combined with a aracteristic sound of the same object[28]. For example an image could show a dog while the audio clip contains barking. e disadvantage of this approa is the limited pool of combinations and by that the system can be broken by generating and solving all possible allenges manually. e paper discussing this approa is aware of this problem and suggests for websites to loout IP addresses trying to solve too many CAPTCHAs in a certain amount of time[28]. For Mail-Shake this does not work as the aaer can use different email addresses to generate allenge emails. By that this CAPTCHA system does not fulfill the requirements implied by the public availability of the test.

3.1.6 Riddle

A riddle is not really a common used CAPTCHA tenique, but it is stated in the paper discussing the Mail-Shake approa[19] as a possible way to encode the allenge. By that it has to be discussed in this section as well. A riddle is a very generic description and there are different types of riddles whi could be used as a way to tell computers and humans apart. In general a riddle can be considered as a hard AI problem and fulfills the requirements for a CAPTCHA.

⁵Animal Species Image Recognition for Restricting Access 28 3 Baground

ere are basically two categories of riddles whi have to be discussed in this section. e first is a riddle whi does not encode the email address, but has a distinct solution like a digit or a word. In the scope of Mail-Shake su a riddle could be used to encode the unique identifier instead of encoding the email address. e second kind of riddle reveals the email address. So it has to be a kind of instruction for an algorithm to reveal the address. e first kind of riddle could be a semantic CAPTCHA system. e human has to show that he understands the semantics of the presented words. For example three words of animals are presented: two of them are birds one is a mammal. e user has to recognize whi of the three given animals differs from the other two[38]. Of course su a system has to be combined with image based CAPTCHAs as presenting just the words in a not obfuscated way renders the riddle useless as it is just a maer of probability to solve the riddle. e semantic CAPTCHA approa has some obvious disadvantages. First of all it does not add any additional security to the system than the image based CAPTCHAs. As soon as the words are recognized by an OCR system solving the semantic CAPTCHA is just a maer of probability. It is obvious that the pool of possible riddles is limited. By that ea time an aaer breaks one CAPTCHA the aaer gains information on three words. e aaer knows that two words are of the same category while the third does not belong to the category. Whenever a riddle is presented with two of the three words the probability to solve the riddle is increased to one half instead of one third. In that way an aaer is able to generate a semantic database and solve ea new allenge. By using existing semantic databases like the Dublin Core Metadata[68] in combination with existing reasoners it is possible to solve the riddle by using queries[17]. e strength of the system does not rely on the strength of the AI problem but on keeping the semantic database secret. By that it does not fulfill the requirements of Kerhoffs’ principle and has to be considered as broken. is CAPTCHA system is not only easy to break but also very difficult for certain groups of people. For example it becomes very difficult for someone who does not know one of the presented animals. is is very likely for non native speakers, ildren or illiterate people. Altogether the system is difficult to solve for humans but easy to solve for computers and is not more secure than image based CAPTCHAs. Other riddles whi are based on a finite number of predefined riddles can be broken in a similar approa and have the same accessibility problems. e second category of possible riddles is an instruction to retrieve the email address from the given data. An example for su an instruction could look like the following:

Use the third leer of this mail followed by the first leer of the seventh word. e next leer is the first leer of the next word. Aer this use the third leer of the second word of this sentence. e last leer is the first leer of the second last word.

Using this instruction the word “email” is retrieved. It is obvious that the reader has to have the 3.1 Evaluation of Current CAPTCHA Teniques 29 skill to read and understand the instruction and that the reader has to be able to count. As the instruction is not obfuscated this instruction could be presented by a Braille display and is by that solvable by blind people. Nevertheless from an accessibility point of view the instruction is not a good solution as it becomes very difficult to solve this riddle for non native speakers as well as for illiterate people. e algorithm is aaable as the set of construction rules is limited. If there were an algorithm whi could use an infinite number of instructions there would be a program whi is able to un- derstand the human language. If there were su a program it would be possible to implement a program whi is able to understand the instruction and to solve it. In case the set of construction rules is limited it has to be expected that the aaer knows all rules and is by that able to write a lexical scanner and a parser and is by that able to solve the riddle encoded in the instruction. is shows that even a riddle completely embedded into the allenge email is solvable by a computer program as the rules to construct the riddle are limited and given Kerhoffs’ principle the complete set and the algorithm is known by an aaer. e strength of riddles does not depend on randomness but in keeping the algorithms and used data secret and by that riddles are unusable for CAPTCHAs.

3.1.7 reCAPTCHA reCAPTCHA is a web service developed by the Carnegie Mellon University and provided by Google Inc. Instead of just generating a random image based CAPTCHA, words from scanned books and articles are used. e CAPTCHA is made up from two words whi could not be recognized by two different OCR applications and whi are additionally distorted. is process is illustrated in Figure 3.4. It can be assumed that the CAPTCHA is more secure than a random one as the current state of OCR soware is unable to break it. One of the two words is known and is used as a control word. If the user enters the control word correctly, it is assumed that the other suspicious word is correct as well. If a suspicious word has been recognized in the same way by the first three users, it becomes a known word and will be used as a control word in future. is approa helps to digitize books. e reCAPTCHA system is used by more than 40,000 websites and in the first year of usage 440 million suspicious words could be deciphered[1]. reCAPTCHA offers a service (called “Mail-Hide“) to protect email addresses requiring to solve one of their CAPTCHAs to retrieve the email address. So to say it fulfills exactly the requirement of Mail-Shake: aer solving the allenge the address is revealed. A Mail-Shake allenge email could include a link to the web page containing the reCAPTCHA. is has the advantage that the system is known to be secure, it is accessibly even for blind users as it provides an additional aural

⁶Reference: [1] 30 3 Baground

Figure 3.4: e words to be used for reCAPTCHA⁶

CAPTCHA and it does not generate additional traffic by including an image or audio clip in the allenge mails. e disadvantage is that Mail-Shake depends on an external provider and that the user has to register the email address at the Mail-Hide service. Of course an implementation can provide help in this task. Although reCAPTCHA seems to be more secure as comparable image based CAPTCHAs due to the fact that two state-of-the-art OCR applications failed at recognizing the used words, there is a flaw in the design. It is known that reCAPTCHA only uses English words and that it is sufficient to solve only one of the two words. is knowledge can be used to aa reCAPTCHA and su an aa on the CAPTCHAs used in early 2008 resulted in a success rate of 17.5 %[69].

3.1.8 Conclusion

e evaluation of existing CAPTCHA tenologies shows that many teniques like simple obfus- cation or riddles are not secure. Especially developing an own CAPTCHA system seems to be out of scope as it is very likely that the implementation can be broken. So for Mail-Shake an existing implementation has to be used. Unfortunately strong CAPTCHA implementations have severe accessibility problems: image based CAPTCHAs cannot be solved by blind people and audio based CAPTCHAs cannot be solved by deaf people. It looks like it is impossible to provide a CAPTCHA for Mail-Shake whi is both strong and fulfilling the requirement of accessibility. e possible solutions discussed in the litera- ture do not fulfill other requirements of Mail-Shake. One of these additional requirements is to keep the amount of data to be included in ea allenge mail as small as possible. As Mail-Shake is used in combination with email clients the allenge mails are sent by the client and by that it cannot be taken for granted that there is a broadband connection. 3.1 Evaluation of Current CAPTCHA Teniques 31

Sending allenge mails with for example aaed images and audio files might break the traffic limit and render the Internet connection unusable for the user. is is a possible injection vector for spam bots: generating allenge mails to cause a denial of service. To prevent su a (distributed) denial of service aa the allenge mails should be text only if possible. For Mail-Shake it is not sufficient to use a limited pool of CAPTCHAs as this does not fulfill the requirement of being public. If the pool of CAPTCHAs is limited and the same pool is used in ea installation an aaer just has to solve ea CAPTCHA once. ere are companies providing the service of breaking CAPTCHAs and promising to deliver 250,000 solved CAPTCHAs per day at a cost of USD 2 for 1000 solved CAPTCHAs[16]. is renders breaking a limited pool of CAPTCHAs just a small one time investment. Most of the requirements are fulfilled by the reCAPTCHA Mail-Hide service. It is well tested and provides the required security. Including reCAPTCHA in a allenge mail just requires to include an URL and by that does not generate additional traffic. e reCAPTCHA system provides both a visual and an aural CAPTCHA. So people with impaired visibility are able to solve the allenge. Unfortunately it does not provide a solution for deaf-blind people, but as discussed it is difficult to provide a CAPTCHA system whi is accessible to everyone and secure at the same time. As it is sufficient to include an URL in the allenge mail, relying on reCAPTCHA is not a sin- gle vendor lo-in. If there is another service providing web-based CAPTCHAs to protect email addresses this service could be used as well. ere already exists another service called ”scr.im”⁷ and another candidate for providing su a service is the Asirra CAPTCHA. So Mail-Shake uses the reCAPTCHA Mail-Hide system in a way that it is possible to swit to a different web-based service. Of course using URLs inside mails has a disadvantage. In general it should be discouraged to cli unknown hyperlinks in unknown mails. A Mail-Shake allenge mail can be considered as su an unknown mail and by that an educated user will probably not follow the instructions in order to protect himself from phishing aas. As well spammers could disguise phishing mails as Mail-Shake allenges trying to direct uneducated users onto their fraud web sites. Both problems can be circumvent by integrating solving the Mail-Shake allenge directly in the email client as shown in Chapter 4.3.3.

⁷hp://scr.im/ 32 3 Baground

3.2 Excursus: Breaking a CAPTCHA System

e strength of a CAPTCHA system is an important factor for the Mail-Shake concept. If the CAPTCHA is broken the complete concept is weakened as the email address is no longer protected. In this excursus the aempt to break an existing CAPTCHA system is discussed. e aa on the “scr.im” CAPTCHAs shows that this system is completely insecure and the CAPTCHAs can be solved in an automated way with a success rate of 95 %. is confirms the assumption that it is in general a bad idea to create your own CAPTCHAs as it is very likely that an untested system is insecure. e scr.im CAPTCHA system was selected for the aa as it is one of the two CAPTCHA systems whi are supported in the Mail-Shake implementation discussed later in this thesis. For a beer usability the user is guided to solve the CAPTCHA without the need to visit the web page. erefore the CAPTCHA has to be downloaded and the user input has to be sent to the server to verify the result. is is discussed in more detail in Chapter 4.3.3. e implementation revealed some flaws in the design of this specific CAPTCHA system, whi motivated to aa the system.

3.2.1 The Scr.im CAPTCHA System

e purpose of the scr.im CAPTCHA system is to protect email addresses in a way that bots are unable to harvest the addresses from web pages. It is an alternative to the Mail-Hide system whi was discussed in Section 3.1.7. Instead of displaying an email address a link to a scr.im CAPTCHA is provided. A user who wants to reveal the email address has to solve the CAPTCHA by oosing the text mating an image CAPTCHA. e system provides a selection of nine possible mates. An example for this CAPTCHA system is illustrated in Figure 3.5. When the user selects the mating text to the image, the protected email address is revealed and displayed to the user. e system is advertised as a safe way to protect email addresses so that an address cannot be harvested.

3.2.2 Flaws in the Design of the Scr.im CAPTCHA System

e scr.im CAPTCHA system has several design flaws whi render the system completely insecure. at means that is is possible to solve the provided CAPTCHAs in an automated way so that no incorrect results are generated. e automated process is able to determine the correct result with a probability of 95 % and not submiing the remaining results at all. By that the scr.im system is unable to recognize an aa as there are no false results. An obvious flaw is to present a selection of possible results. While this improves the usability - and compared to all the CAPTCHAs evaluated in Section 3.1 it is the most usable system - it 3.2 Excursus: Breaking a CAPTCHA System 33

Figure 3.5: e scr.im CAPTCHA system

Figure 3.6: Different images for the same scr.im CAPTCHA increases the probability to solve the CAPTCHA by pure ance from 1/365 (based on an alphabet of 26 uppercase aracters and ten digits) to 1/9. In fact this flaw renders the system completely useless as solving the CAPTCHA by pure ance is in no way a solution to protect against automated processes. Unfortunately the selection of possible mates is not only a way to solve the CAPTCHA by pure ance, but also a way to verify that only correct results are submied to the system. If an aaer can use an optical aracter recognition (OCR) soware to extract the aracters from the image, the aaer can verify the result of the OCR by comparing it to ea of the proposed solutions. e soware wrien to aa this system shows that if at least three aracters mat with those of one of the provided solutions, the correct result is always found. e CAPTCHA is generated ea time the user reloads the web page. e image itself is also generated when it is downloaded, but the same aracter string is reused. at means it is possible to generate several images for the same CAPTCHA, as illustrated in Figure 3.6, without regenerating the list of possible solutions. is can be used by an aaer to try another image for a beer OCR result. Why the image is not stored on the server but generated ea time is a good question as reloading the web page generates a completely new CAPTCHA with a new selection of possible solutions. e CAPTCHA does not follow the recommendations for creating secure CAPTCHAs[69]. While 34 3 Baground the baground of the image has some noise the color is different to the colors used for the leers. e baground uses a light green while the aracters are either a dark green or a dark red. By using a simple thresholding algorithm the baground can easily be replaced by white pixels and the leers by bla pixels. e aracters do not overlap, whi implies that the image can be sliced to generate one image for ea aracter. Even more this eliminates the rotation applied to some of the aracters as ea aracter image can be rotated individually until the OCR soware is able to produce a useful result.

3.2.3 Attack on the CAPTCHA System

Although the implemented soware is able to break scr.im CAPTCHAs, it is unable to harvest email addresses stored in the scr.im system or to reveal several addresses. e soware is meant as a proof of concept that the used CAPTCHAs are insecure and by that it can only operate on one given address and solves successively new CAPTCHAs for this address. Ea solved CAPTCHA reveals the same already known email address and the soware does not even present the address in the user interface as that information is irrelevant for the purpose of breaking the CAPTCHA. To solve the CAPTCHA in an automated way the web page has to be downloaded and the image and the possible solutions have to be extracted. e image has to be processed and passed to the OCR soware. e generated, possible result has to be sent to the web page and the returned web page has to be parsed to e if the CAPTCHA has been solved correctly. e task of downloading the web page and submiing the result has been implemented in the scope of integrating Mail-Shake into an existing email client. So the soware to aa the system reuses the code wrien for the email client integration and this is the main reason why the scr.im system has been osen as the target for the proof of concept aa. e implementation of this client integration and the interaction with the web based CAPTCHA system is discussed in more detail in Chapter 4.3.3.3 beginning on page 78. e main task of the application is to modify the image in a way that the aracters can be recognized by an OCR soware. erefore the image is passed to the “OCRopus” command line tool, a state-of-the-art open source OCR system whose development is sponsored by Google Inc. e result is compared aracter by aracter with ea of the provided possible solutions. If at least three out of the five aracters are mating one of the possible solutions, this one is tried to solve the CAPTCHA. In case there is no possible mat a different image is downloaded and the process is restarted. If five successive downloaded images do not yield in an acceptable possible mat, the CAPTCHA is skipped and not submied. By that there is always a e so that no false result is generated. 3.2 Excursus: Breaking a CAPTCHA System 35

(a) Scr.im (b) Netbank

Figure 3.7: Comparison of original CAPTCHA image and the result of the pixel shader.

uniform sampler2D texture;

void main() { vec4 color = texture2D(texture, gl_TexCoord[0].st); // red characters are not recognized, so remove red component if (color.r > 0.6){ color.r = 0.0; } // set all bright pixels to white and dark pixels to black if (color.r + color.g + color.b > 1.1){ gl_FragColor = vec4(1.0, 1.0, 1.0, 1.0); } else { gl_FragColor = vec4(0.0, 0.0, 0.0, 1.0); } }

Listing 3.1: Pixel shader for extracting aracters from the scr.im CAPTCHAs OCRopus is unable to extract the aracters from the CAPTCHA as it is provided by the system. Because of that the image has to be manipulated. e image is rendered to an off-screen OpenGL texture (Framebuffer Object) and the result is saved and passed to OCRopus. While rendering the image, it is manipulated by a pixel shader wrien in the OpenGL Shading Language. e source of this individual shader is illustrated in Listing 3.1 whi uses only a trivial thresholding algorithm. In a first step the red component is removed from the red aracters and in a second step all bright pixels are set to white and the dark pixels are set to bla. is removes the noisy baground and only the aracters are visible and OCRopus is able to extract the aracters. A comparison of the original and the modified CAPTCHA is presented in Figure 3.7, whi also illustrates a second CAPTCHA used by Netbank to protect the login of their Online Banking service⁸. With only minimal adjustments to the pixel shader this CAPTCHA is solvable by OCRopus as well. Of course there is no automated aa against the Online Banking service as the login is also protected by a password. e application (source code is listed in Appendix E on page 141) was used to test 100 scr.im CAPTCHAs. For 95 of them OCRopus was able to extract a string with at least three mating aracters aer a maximum of trying five reloaded images. In every case the best mating string was the solution for the CAPTCHA - also for the five CAPTCHAs whi have not been submied. 45

⁸https://banking.netbank.de/wps/portal/netbank-banking-portal 36 3 Baground

CAPTCHAs could be solved by only downloading one image, 28 more with the second downloaded image. For 20 images all five aracters were mating, four mating aracters were found for 40 images and a remaining of 35 images were only solved with three mating aracters. As said five images were untested but those had a possible correct mat with one or two mating aracters. e result is rather disastrous: 20 % of the tried CAPTCHAs were solved correctly without the help of the nine possible solutions only by using a trivial thresholding algorithm to remove the noisy baground. e OCR system has not been trained to render beer results for the given set of images. e resulting text might includes aracters whi are not used by the CAPTCHA system at all like lower case leers or punctuation marks.

3.2.4 Lessons Learned

e proof of concept aa described in this excursus illustrates several issues: first of all trying to make a CAPTCHA more usable may result, as seen in the given example, in a completely useless system. It has to be ensured that the CAPTCHA at least cannot be solved by pure ance. is implies that the scr.im support has to be removed from the Mail-Shake implementation, before it is offered to a wider audience until the system is more secure. A second important issue illustrated by this simple aa is the missing security of many image based CAPTCHAs. Even a CAPTCHA used to secure the login of an Online Banking service can be broken by a rather trivial aa. is confirms the conclusion from the evaluation of existing CAPTCHAs in Section 3.1: implementing an own CAPTCHA is very likely insecure. It is beer to use a secure and well tested system su as the reCAPTCHA system. Of course not every existing system provides the required security as seen with the scr.im system. e security of the scr.im system needs to be improved, therefore the possibility to solve the CAPTCHA by ance has to be removed. is implies that the usable variant has to be replaced by the more commonly used variant to type the aracters into an input field. Nevertheless the images have to be improved as well. First of all it should be impossible to generate several images for the same CAPTCHA. e image generation has to be anged so that it is at least not possible to extract the aracters by using a trivial thresholding algorithm, but other security features should be added as well. 3.3 Akonadi 37

3.3 Akonadi

In this Section the soware whi is used for implementing Mail-Shake and Spam Templates are discussed. Both parts are built upon the same soware and share as mu code as possible. e im- plementation is based upon the Akonadi framework developed and provided by the KDE developer community and using the libraries of the KDE Development Platform in version 4.4.

3.3.1 Client Plugins Compared to Central Storage

In the original Mail-Shake concept paper[19] two different implementation approaes are pre- sented. One of these proposes to develop extensions for existing email clients su as Microso Outlook, or Mozilla underbird. Most of the email clients offer an API to develop ex- tensions. With su an extension it is possible to handle the Mail-Shake concept as the extension has access to the received mails and can send mails, etc. Spam Templates could be integrated using an extension just like other filtering extensions. Nevertheless su an approa has some disadvantages: for ea client a custom extension has to be developed. e ances to reuse code wrien for one client are rather small as for example Mozilla uses the XML User Interface Language a custom programming language to develop extensions. By developing several extensions it is possible that the implemented behavior differs and that different Mail-Shake implementations become incompatible. It is also an impossible solution for clients whi do not support extensions or clients where the user is not allowed to install extensions, for example all web based clients running on a server. To circumvent these problems a second approa is presented. Instead of integrating into an existing email client an answering tool is used as illustrated in Figure 3.8. A central storage for the whitelist and unique identifiers is introduced and the public account is completely decoupled from the email client, but handled by a special answering tool. is approa still requires a plugin for

Figure 3.8: Two different applications to handle public and private addresses[19] 38 3 Baground the email client, but this can be smaller as it only has to handle the private address and not care about the public one.

3.3.2 Akonadi as the Central Storage Solution

e soware solution used for the implementation in the scope of this thesis is based on the idea of the second approa, but with a subtle difference. Instead of using an extension for the mail client, the communication with the private address is moved to the answering tool as well. e complete Mail-Shake process is implemented in an own, client independent application. To communicate with mail servers the Akonadi⁹ framework is used. Akonadi is a cae for all per- sonal information management (PIM) related data, whi are emails, contacts and similar data. e framework is developed by the KDE developers community, especially by Klarälvdalens Datakon- sult AB (KDAB), to solve the problem when several applications need to access the same data. For example to access an appointment aaed to an email stored on an IMAP¹⁰ server requires that next to the calendar application an email client has to be opened. Selecting a recipient address in an email client requires to have the address book in memory although it is already loaded in the contact management application. When Akonadi is used, the applications communicate with a central server process and this server communicates with the different resources like IMAP servers, POP3¹¹ servers or vCalendar files. At the same time Akonadi provides a cae of retrieved items and metadata for these. e basic aritecture is illustrated in Figure 3.9. Using this framework for Mail-Shake and Spam Templates has various advantages. e imple- mentations become independent from an email client on the one hand without the need to im- plement protocols on the other hand. Akonadi is available for various platforms. It is a runtime dependency for the PIM suite provided by the KDE community, but does not have a dependency to the KDE Plasma workspace. Although initially implemented for the usage on Linux/Unix sys- tems, Akonadi has been ported to Microso Windows and work is going on to port Akonadi to mobile platforms[31]. is increases the targeted user group for Mail-Shake and Spam templates significantly as all major platforms are supported.

3.3.3 Design of Akonadi

Akonadi is not only used in applications su as email clients, it also provides the possibility to have small applications whi operate independently on provided data. Use cases are for example mail filtering or storing all received mails in a sear index. ese applications are called Agents. ere is

⁹http://pim.kde.org/akonadi/ ¹⁰Internet Message Access Protocol ¹¹Post Office Protocol 3.3 Akonadi 39

Figure 3.9: Basic aspects of the Akonadi aritecture[71] a special kind of agents, the so-called Resources, whi provide data. E.g. access to an IMAP server is provided by a resource running as an own process. is design, whi is illustrated in Figure 3.10, has several advantages. e application does not have to provide all the functionality, but can rely on external applications. Without this outsourcing the email client’s user interface might freeze when filtering mails, or in case there is a crash in the IMAP implementation, the email client does not crash, but only the IMAP Resource, whi can be restarted instantly. Akonadi informs an Agent whenever a new mail is received, so that the Agent can process the mail. is perfectly fits the needs of Mail-Shake and Spam templates. Mail-Shake has to reply to mails sent to the public address and drop not whitelisted mails sent to the private address. Spam templates has to mat a received mail against all existing templates. As Akonadi is designed to be used by email clients, it is also possible to fet received mails. is fits the needs of Spam templates, whi wants to mat received, but unread mails against new received templates. So Mail-Shake and Spam Templates have to be implemented as Akonadi Agents. ese agents are built upon libakonadi, a client side library to communicate with the Akonadi server. As Akonadi is still under heavy development, the implementation requires the most recent release of the KDE development platform¹². e development platformed is shipped with the KDE Soware Compi- lation. e following release in summer 2010 will include the Akonadi version of the KDE email

¹²Version 4.4 released in February 2010 40 3 Baground

Figure 3.10: Components of Akonadi[35] client “KMail”.

3.3.4 Summary

In this Section the Akonadi framework was presented. is framework is used for receiving and sending mails instead of implementing an extension for an existing email client. So the implemented solutions works with all email clients and the mail filtering is performed outside the email client, running as an own process. 41

4 Development of the Systems

In this Chapter the development of the two concepts is discussed. First of all the requirements for Mail-Shake are presented. is motivates the soware design leading to the actual implementation. e implementation is split into three parts: a client independent library, the Akonadi based client implementation and an integration of Mail-Shake into an email client. Last but not least the imple- mentation of Spam Templates is discussed, whi mainly concentrates on mating mails against the generated templates.

4.1 Software Requirements for Mail-Shake

In this Section the soware requirements for the Mail-Shake daemon implementation are discussed from a very high level point of view. is includes the requirements for handling mails sent to the public and private email addresses as well as the requirements for the handling of mails sent by the user of a Mail-Shake setup. e requirements are directly derived from the paper describing the Mail-Shake approa[19]. It does not yet contain any details about the later implementation. ese are discussed together with the design in Section 4.2.

4.1.1 Answering Spam Messages

All emails whi are sent by the Mail-Shake daemon may not include any data from the received mails. It is possible that spam mails use faked addresses, by that the Mail-Shake allenge mail is sent to the owner of the misused email address and not to the original sender. If the allenge mail includes for example the subject of the received mail, the spam is forwarded. at way spammers could use a Mail-Shake setup as an open relay. is misuse is illustrated in Figure 4.1. Because of this situation mails sent by the Mail-Shake daemon have to provide information about Mail-Shake as well as a notice that it is safe to ignore the mail if the receiver has not invoked the authorization progress. 42 4 Development of the Systems .

.send spam email .Spam Bot .(masked as User B)

.User A .(Mail-Shake) .send Mail-Shake .allenge

.User B .(No Mail-Shake)

Figure 4.1: Abusing Mail-Shake to send spam

4.1.2 Delivery Status Notifications

e Mail-Shake implementation must be aware of Delivery Status Notifications (DSNs). A DSN can be sent by the Mail Transfer Agent (MTA) to notify the sender about conditions in the mail delivery process su as delayed delivery or failed delivery[44]. Many spam mails use not existing email addresses as the sender address. So the delivery of a Mail-Shake allenge mail sent to the not existing address will of course fail and the MTA will return a DSN. Concerning the implementation as described in this Section the Mail-Shake daemon answers a DSN with another allenge mail sent to the administrator of the mail system[44]. is causes mu unneeded mails being sent to the administrator or a not existing address causing another DSN and by that Mail-Shake should not respond to a DSN. In case a DSN is sent to the private email address a special handling is required. is DSN might be caused by sending a notification in reply to a not whitelisted mail. In that case the DSN should be dropped. But the DSN could also be sent in response to an undelivered or delayed mail sent by the user. Given the decision tree as shown in Figure 4.4 the mail would be dropped without any further notice to the user as the address used by the MTA is not on any whitelist. To solve this problem Mail-Shake must be aware of DSNs. Whenever a DSN is received on the public email address it can be dropped without any further processing. On the private address it has to be eed if the DSN is sent in response to a mail sent by the Mail-Shake daemon or by the user. In the laer case the DSN should not be dropped. Unfortunately RFC 3464 does not name any required fields in the DSN, whi could be used to identify if the mail was sent by the Mail-Shake daemon or by the user. e best existing heuristic 4.1 Soware Requirements for Mail-Shake 43 is to use the optional third MIME part of the multipart/report whi is the original mail or part of it[44]. e existance of a Mail-Shake header can be used to determine if the DSN has to be dropped or not. In case that the optional original mail is missing the DSN is not dropped but delivered to the user. Delivery status notifications are a possible aa vector on the Mail-Shake approa. e fact that DSNs received at the private address are delivered to the end-user, can be abused by spammers. If the spam mail is disguised as a DSN, the mail is not dropped as it is not in reply to a Mail-Shake notification mail. e Mail-Shake approa does not handle su a case. e problem could be solved on the client side as the client knows the sent mails and could e if the DSN is in reply to a sent mail. In scope of this thesis no solution to solve this problem is implemented. is is a future task if spammers start to abuse delivery status notifications.

4.1.3 Public Mail Address

e Mail-Shake implementation has to be notified about ea new mail sent to the public email address. e mail itself is dropped and never read by a human. Nevertheless the mail has to be processed by the Mail-Shake daemon. e mail is not processed if it is a reply from another Mail-Shake setup. It is impossible that a Mail-Shake allenge mail is sent to the public email address as the user sends his outgoing mail from his private email address and by that a response is sent to the private as well and not to the public address. So it can be assumed that a Mail-Shake allenge mail sent to the public email address is a fake one and by that should not be processed. For sending the response mail two steps are required. First a unique identifier (Id) has to be generated and added to the mail. is should be done in a computer and human readable way. e human readable way implies that the Id is added either to the subject or the mail body. e computer readable way should be added as a separate email header, e.g. X-Mailshake-ID. at way a Mail- Shake aware client is able to recognize that the received mail is an automatically generated response and can guide the user by resending the original mail. e unique identifier has to be generated in a random or at least pseudo-random way. If the Ids are just incremented this could be used to aa Mail-Shake. e second step is to include the allenge into the mail. As the allenge can be of any kind, it can just be included into the mail body or as an aament. Nevertheless, the best way to pro- vide a allenge is to rely on the Mail-Hide API provided by the reCAPTCHA web service. is has another advantage as the link to the reCAPTCHA can be embedded into another mail header, e.g. X-Mailshake-URL. So a Mail-Shake aware email client is not only able to recognize a Mail- Shake allenge mail, but can directly display the CAPTCHA to the user and extract the private email address aer solving the allenge. By that the client is able to provide the best possible user 44 4 Development of the Systems

Is email a delivery Is email a Mail- status notification? Shake challenge?

[No] [No] Generate unique Id

[Yes] [Yes]

Add Id and Send Mail-Shake Drop email sender to storage challenge email

Figure 4.2: Activity diagram for processing mails sent to the public address experience. Next to sending the allenge mail, the Mail-Shake daemon has to add the Id in combination with the sender’s email address to a storage, whi is required by the private address handling to identify new authorized senders. e processing of the received mail is finished with these steps. In Figure 4.2 the whole process is illustrated by using an activity diagram. e mail can be dropped or moved to a special folder or marked as read. It might be useful to leave this decision to the user and make the handling configurable.

4.1.4 Sending Mails

Whenever the user sends a mail the Mail-Shake daemon has to ensure that responses are not dropped when they arrive at the private email address. e user sends his outgoing mails from the private email address and not from the public. at implies that a reply can be sent to the private email address without a previous mail to the public address. ose are of course legitimate mails and must not be dropped. erefore a temporary whitelist is used. It is only needed to add the recipient to the temporary whitelist if it is not already listed in either the temporary or the permanent. In case of sending a mail containing the Id of a Mail-Shake handshake the recipient’s address can directly be added to the permanent whitelist. e processing steps are also illustrated in Figure 4.3. 4.1 Soware Requirements for Mail-Shake 45

Reciever is on whitelist

Email is a reply to a Mail- Shake challenge email? Reciever is on temporary whitelist

Add receiver to Add receiver to permanent whitelist [Yes] [No] temporary whitelist

Figure 4.3: Activity diagram for sending mails

4.1.5 Private Mail Address

When a mail is received on the private address the Mail-Shake daemon has to process this as well. e sender’s email address has to be eed against the whitelist and a mat indicates a legitimate mail, whi must not be dropped and the processing of this mail can be terminated at this point. If the sender is not listed on the whitelist further processing steps are required. First of all a test has to be performed if the mail is a reply to a Mail-Shake allenge mail triggered by a previous mail sent to the public address. In that case the mail must contain the unique identifier. If the sender’s email client is Mail-Shake aware it adds a header containing the Id (e.g. X-Mailshake-Response-ID). e more complicated and more realistic case is that the Id is embedded into the subject. is requires that the Id can be parsed even if the sender added some additional text to the subject. It might be useful to use a structure for the unique Id whi can be mated by a regular expression. Aer extracting the Id it has to be mated together with the sender’s email address against the storage. If there is a mat the handshake process is completed: the sender authenticated himself as a human and the email address can be added to the permanent whitelist while it can be removed from the Id storage or the temporary whitelist. In case the Id is not mating the mail should be dropped. is is not an acceptable step in case of a simple spelling mistake. e sender assumes that he has been authenticated and that the mail has been received and read by the sender. A possible solution for this problem is to send a notification that the Id is incorrect. ere are also legitimate mails triggered by sending a mail. In that case the sender’s email address is listed in the temporary whitelist. e email address can be transfered from the temporary to the permanent whitelist. If the daemon recognizes that the received mail is a Mail-Shake response 46 4 Development of the Systems

On whitelist Is email a delivery Drop DSN status notification? [No] Is received email [Yes] [Yes] a Mail-Shake Is email a reply challenge email? to a Mail-Shake [No] Is DSN in reply challenge email? to a notication? [Yes] [No] On temporary whitelist [No]

Id incorrect Not whitelisted [Yes]

Send Remove sender from notification Drop email temporary whitelist

Id and sender are matching Remove Id Add sender from storage to whitelist

Figure 4.4: Activity diagram for receiving mails on private address containing a allenge, it is sufficient to remove the email address from the temporary whitelist without adding it to the permanent whitelist as no further mails will be received from that address. In Figure 4.4 the complete processing of mails sent to the private mail address is illustrated. e decision tree decides if the mail is dropped or delivered to the user. It might be beer to not just drop the mail, but to send a notification about the Mail-Shake setup to offer senders unaware of the authentication process a possibility to legitimate themselves. Another issue, whi has to be mentioned in the context of the requirements for handling the receipt of mails sent to the private email address, is the problem of communication with computers. A possible solution is to delay dropping the mail, so that the user has the ance to manually add the address to the whitelist.

4.1.6 Summary

In this Section the soware requirements for the Mail-Shake daemon implementation were dis- cussed. e requirements are split into the handling for mails received by the private or public ad- dress and for mails sent by the user. It was shown whi processing steps have to be implemented for the Mail-Shake concept. A special requirement for handling delivery status notifications was also discussed, so that those do neither cause additional mails sent to the mail system’s administrator nor to the user. 4.2 Design of Mail-Shake 47

4.2 Design of Mail-Shake

In this Section the soware design of Mail-Shake is discussed. e implementation consists of three parts: a client independent library, an Akonadi agent and a client integration plugin for KDE’s mail client applications. It is shown why the implementation is split in several parts and how the different parts commu- nicate and work together. erefore a closer look on the low level design, that is class structure and communication is provided as well.

4.2.1 Client Independent Library

4.2.1.1 The Need for an Independent Library

e complete functionality of Mail-Shake is split into a separate library. is library handles the processing of mails received by the public and private address, adds whitelist entries and ids. But it does not handle eing for new received mails and sending mails. is has to be done by a client implementation su as the Akonadi agent. Of course it would be possible to handle Mail-Shake completely in the Akonadi agent. But im- plementing an own small library increases the portability. While Akonadi itself is completely client and plaform independ (see Chapter 3.3) it is unlikely that competitors of the KDE PIM suite would provide the necessary integration in their client for handling Mail-Shake or even ship Akonadi. Even in the free and open soware world the Qt and KDE dependency might be a reason to not integrate Mail-Shake into email clients based on the GUI toolkit GTK+ su as Mozilla underbird or Evolution. ese political problems can be circumvented by providing a small client independent library whi does not depend on Qt and KDE libraries. e Mail-Shake library is implemented with the standard C++ library and depends only on the Boost libraries¹ as an external paage. Boost is a common dependency to many C++ projects[54] and by that not a possible political issue su as a dependency to Qt.

4.2.1.2 Designing the Independent Library

e library should be small and lightweight. is implies that it only provides the required func- tionality whi should not be provided by the client implementations. As already mentioned the library does not handle receiving or sending mails. ese tasks are provided by the clients and do not have to be duplicated by the library.

¹http://www.boost.org 48 4 Development of the Systems

EMail fromAddress:string replyTo:string sender:string toAddress:string messageId:string inReplyTo:string challengeId:string DSN challengeURL:string originalEmail:EMail challengeResponseId:string subject:string headers:map > addHeader(h:string, d:string) header(name:string):list replyMail():EMail isMailShakeChallenge():bool isMailShakeChallengeReply():bool

Figure 4.5: Classes EMail and DSN of the Mail-Shake library

As well implementing protocols like SMTP, POP3 and IMAP is a great burden. e protocols are complex and implementing them correctly is a task in the size of an own thesis, ea. As well Mail-Shake does not require complete mails: it only needs to read headers, while the body does not contain any relevant information. Nevertheless for handling DSNs a MIME implementation is required. ere are of course existing MIME implementations in various clients su as the KMime library whi is part of the KDE PIM libraries. erefore the library just has to be an abstraction whi is tailored to the needs of Mail-Shake. ere needs to be a representation for a mail, but it does not have to be standard compliant as that requires the implementation of MIME for example. It is the task of the client implementations using Mail-Shake to transfer the mail from the client representation to the representation used by the Mail-Shake library. For Mail-Shake a DSN is just a regular mail with the addition that it should know that it is a DSN and that there might be an aaed original mail. e original mail is required to distinguish if the DSN is a response to a mail sent previously by Mail-Shake or to a mail sent by the user. In case of a response to a Mail-Shake notification the DSN should be dropped; otherwise it has to be accepted without sending another notification. e DSN could be modeled by just adding an aribute to the mail class, but aaed mails in general are not required. erefore it is beer to use inheritance and model the DSN as an own class. By that the implementation can just do an instance-of test. Figure 4.5 illustrates the inheritance of EMail and DSN as well as the class members and additional required methods. e geer and seer methods are not included for beer readability. Mail-Shake generates a unique identifier for ea allenge mail. It needs to keep tra of the issued identifiers and the validity of those. As storing the complete set of generated identifiers is done in the client side implementation, the library has to keep all currently used identifiers in local 4.2 Design of Mail-Shake 49

WhiteListEntry date:ptime header:string filter:string matchType:enum isAddressFiltering():bool enableAdressFiltering(bool:enable) matches(data:string):bool

Figure 4.6: Class WhiteListEntry of the Mail-Shake library memory. is is sufficient as for generating an identifier it is required to test that the new one is not a duplicate of a previously generated identifier. Altogether the library provides a wrapper around the identifier containing the metadata like the address for whi the identifier was generated and the creation date so that old identifiers can be discarded. Internally all identifiers have to be stored in an appropriate data structure. e library requires one more data structure for the whitelists. ese consist of entries and ea entry contains the mail header used for mating, the filter and the way how mating should be done. at is if the filter is a regular expression or if it is a regular string whi should be either mated case sensitive or insensitive. E.g. if a mail address should be mated a case insensitive mat is sufficient, but for seing all mailinglists of one domain on the whitelist a regular expression is more appropriate. In case of a temporary whitelist entry it is also required to store a timestamp so that these entries can be discarded automatically as well. As the timestamp is a useful information for the regular whitelist, whi should be editable by the user in the client side implementation, it is stored directly in the entry and a common class for regular and temporary whitelist entries is used instead of an inheritance as illustrated in Figure 4.6.

MailShake

0..* Id 1 MailShake idStorage:map temporaryWhitelist:list whitelist:list EMail privateMailReceived(mail:EMail):bool testPrivateMailReceived(mail:EMail):bool publicMailReceived(mail:EMail) privateMailSent(mail:EMail) DSN addWhiteListEntry(entry:WhiteListEntry) addWhiteListEntries(entries:list) 1 0..* WhiteListEntry

Figure 4.7: High Level Class Diagram of the Mail-Shake library 50 4 Development of the Systems

<> MailShake::Client idAdded(id:Id) <> idRemoved(id:Id) Akonadi::AgentBase::Observer mailShakeChallenge(email:EMail) itemAdded(item:Item, collection:Collection) notifyIncorrectAddress(email:EMail) itemChanged(item:Item, partIdentifiers:set) notifyIncorrectId(email:EMail) itemRemoved(item:Item) notifyMailShake(email:EMail) temporaryWhiteListEntryAdded(entry:WhiteListEntry) temporaryWhiteListEntryRemoved(entry:WhiteListEntry) whiteListEntryAdded(entry:WhiteListEntry)

MailShakeAgent ItemTimer configure() MailShake item:Akonadi::Item privateItemsTimer(timer:ItemTimer) loadStorage() timeout(timer:ItemTimer) loadWhiteList() MailShake start(msec:int) loadTemporaryWhiteList() stop() messageFromMail(mail:EMail):KMime::Message mailFromMessage(msg:KMime::Message):EMail

Figure 4.8: High Level Class Diagram of Mail-Shake’s client side implementation

Last but not least the library requires an API entry point for the client side implementation. e client has to pass received mails for testing if they can be dropped. As well the list of unique identifiers and the whitelists have to be passed to the library. On the other hand Mail-Shake has to send notifications, manipulate the whitelist and add new identifiers. In Figure 4.7 the API entry class is shown including the methods and the relationship to the other library classes. erefore methods in the client side implementation have to be invoked. Instead of providing an interface whi has to be implemented by the client side a signal and slot concept is used. is is discussed in more detail in Chapter 4.3.2.1 beginning on page 69.

4.2.2 Akonadi Agent

e Akonadi Agent has to implement the client side part of Mail-Shake and provide the methods, whi are invoked by the library whenever a notification has to be sent or a datum has to be stored durable. erefore the agent has to implement a common interface, whi is in this case providing the required slots. Another task of the agent is to pass new received mails to the library and to handle mails not whitelisted by Mail-Shake. erefore the agent has to implement the interfaces provided by Akonadi to be notified whenever a new mail is received or an existing anged or deleted. e Akonadi API provides the interface Akonadi::AgentBase::Observer, whi has to be implemented to be notified about new mails. As the mail format of Akonadi does not mat the one used by Mail-Shake methods are required to translate fro and to the Mail-Shake format. An optional delay is required before dropping not whitelisted mails so that the user has the ance 4.2 Design of Mail-Shake 51

msc Private Mail Received Akonadi MailShake Library :server :agent :instance

itemAdded testPrivateMailReceived

msc Public Mail Received Akonadi MailShake Library :timer :server :agent :instance timeout itemAdded publicMailReceived privateMailReceived idAdded notifyMailShake sendMail mailShakeChallenge

(a) Receiving a not whitelisted mail (b) Receiving a mail for the public mail collection

Figure 4.9: Communication between Akonadi server, agent and Mail-Shake library to add an address to the whitelist aer the mail has been received. erefore the processing has to be delayed whi can be aieved with a timer. Aer the timeout a signal is emied and the agent has to provide the mating slot for this signal. If for example the mail is deleted before the timeout, the timer can be stopped so that the mail will not be processed. e required classes are illustrated in Figure 4.8 and the communication between the Akonadi server, the agent and the client independent library is illustrated in Figure 4.9, whi shows the sequence diagram for the receipt of a public mail and a not whitelisted mail. In case of a whitelisted mail or a mail whi is a response to a Mail-Shake allenge the interaction is slightly different. e sequence diagram shown in Figure 4.9(a) illustrates how the timer is created and the processing of a private mail is delayed by this timer. If the mail is whitelisted the timer is not created and the processing is not delayed. Mail handling is only one of the important tasks of the agent. Another very important task is to provide the user interface to configure Mail-Shake. e user interface has to offer the following configuration options:

• Public address management

• Private address management

• Sent mails management

• Whitelist management.

At that point it does not make sense to discuss the high level class design of the configuration interface as this is very specific to the implementation and the libraries whi are used. Of course 52 4 Development of the Systems

<> CAPTCHA <> Mail emailAddress():string isMailShakeMail():bool setCaptchaUrl(url:string) captchaUrl():string showUserInterface():bool

reCAPTCHA scr.im

Figure 4.10: Class diagram for Mail-Shake email client integration the agent also needs to provide the means to notify the user that a mail will be dropped and has to offer a way to access the configuration easily. is is very specific to the implementation and therefore is discussed in Chapter 4.3.2.5 on page 74 and omied here in the discussion of the high- level soware design.

4.2.3 Client Integration

e third and last part of the Mail-Shake implementation is the integration into existing email clients. at is the client should guide the user in solving the allenge. erefore the client has to be made aware of the fact that a mail is a Mail-Shake allenge and has to provide a user interface to solve the allenge. Of course su an integration is very specific to the email client. Nevertheless all different clients have to offer the same functionality, so it is possible to define a common high-level design. e client specific implementation differs slightly from this design. As illustrated in Figure 4.10 the integration has to provide a mail representation. is class has to provide methods to extract the allenge URL. Based on that URL the client integration has to recognize the CAPTCHA provider and use the specific implementation for this provider. In the scope of this thesis the providers Mail-Hide and scr.im have been implemented, so those two have to be supported. In general the URL has to be passed to the provider whi has to download the CAPTCHA and display it in a user interface, so that the user does not need to use a . Last but not least the provider has to extract the email address if the allenge is solved and pass it to the implementation so that the mail can be resent.

4.2.4 Summary

In this Section the high-level design of Mail-Shake has been discussed. It was shown that Mail-Shake consists of three parts:

1. Client independent library 4.2 Design of Mail-Shake 53

2. Client specific part

3. Integration in existing email clients.

e client independent library is the actual implementation of the Mail-Shake concept without the possibilities to receive and send mails as well as handling persistent storage. e task of the library is to decide if a passed private mail has to be dropped or not. As well it is responsible for generating unique identifiers for public mails and compare these to the one listed in the private mails. In case of a mat it has to add an entry to the whitelist. e task of the client specific part is to receive and send mails and to interact with the Mail-Shake library. By separating these two tasks it is possible to reuse the code if an integration to a specific email client should be implemented. In this Section a specific implementation was proposed: the Akonadi Agent. Its task is to pass mails received from Akonadi to the library and to provide a configuration interface. Last but not least there has to be a direct integration into email clients for solving the allenge without visiting a web site and by that the problem that opening hyperlinks in mails should be discouraged is circumvented. e integration is client specific and by that only a very abstract example could be provided. 54 4 Development of the Systems

4.3 Implementation of Mail-Shake

In this Section the implementation of Mail-Shake is discussed. Based on the separation in three parts as discussed in Section 4.2 this Section follows this separation, too. First of all the implementation of the client independent library is discussed. is includes the implementation of the complete Mail-Shake functionality as this is the task of the library. e second topic of this Section is the implementation of the Akonadi Agent. It is shown how the agent interacts with the library and how Akonadi is used and mails are received and sent. In this scope the discussion of the user interface is another important issue. Mail-Shake has to provide an easy way to manage the whitelist and has to notify the user whenever a mail is dropped so that the user can ange the whitelist in order to communicate with automated mail systems. Last but not least the integration of solving Mail-Shake allenges into an existing email client is discussed. As at the time of this writing the KDE email client KMail is being ported to Akonadi it does not make sense to integrate into that client. Instead the integration is shown for the alternative KDE email client Mailody whi is already using Akonadi. It will be an easy task to reuse the code in the Akonadi port of KMail.

4.3.1 Mail-Shake Library

4.3.1.1 Binary Compatibility

e primary idea behind the client independent library is to offer a common implementation whi can be reused in various email clients to provide a direct integration without relying on the presence of Akonadi, as it is currently only guaranteed in a KDE workspace. Although in the scope of this thesis only one application is wrien whi uses the library and would not require the need for a dynamically linked library, the Mail-Shake library is implemented in a way that it is easily possible to create a dynamically linked library instead of a statically linked library. erefore the library has to provide binary compatibility whi means that a program linked dynamically to a former version of the library will continue to work without the need to recompile with the new binary and will produce the same results[57]. If the library does not provide binary compatibility ea program using the Mail-Shake library has to be linked statically, whi wastes resources and applications do not benefit from bugfixes and extensions without shipping a new release[20]. In case of Mail-Shake an example for the need of binary compatibility are the optional notifica- tions. If a new mail notification is added in a future release an application using Mail-Shake should behave the same way as before the new notification was added. All existing functionality has to continue to work and the new optional notification is unused. An optional and unused extension may not result in the need to recompile or worse in runtime errors or a complete refusal to start the 4.3 Implementation of Mail-Shake 55

EMailPrivate fromAddress:string replyTo:string EMail sender:string toAddress:string addHeader(h:string, d:string) messageId:string header(name:string):list inReplyTo:string replyMail():EMail challengeId:string isMailShakeChallenge():bool challengeURL:string isMailShakeChallengeReply():bool challengeResponseId:string subject:string headers:map > DSNPrivate DSN originalEmail:EMail

Figure 4.11: Classes EMail and DSN split in interface and implementation classes application with an error su as: symbol-lookup error: libkdeinit4_kwin.so: undefined symbol: -ZTI26KDecorationFactoryUnstable² In order to provide binary compatibility the implementation of the Mail-Shake library follows the recommendations for binary compatibility with C++ provided by the KDE community. For example it is not possible to safely add new data members to classes as this anges the size of the class[20]. A possible way to circumvent this problem is the usage of the “d-Pointer” tenique³. e design class is split into two implementation classes. One class provides the interface while the other class contains the actual implementation. e interface class contains only one pointer to the implementation class as a data member and the actual data is completely embodied in the implementation class. Of course the additional indirection has a performance cost and inheritance is less useful[12]. In Figure 4.11 the ange to the class structure of the EMail class as defined in the design Chapter in Figure 4.5 on page 48 is illustrated. Another issue concerning binary compatibility are virtual functions. In C++ a method whi can be reimplemented in a derived class has to be declared as virtual[61]. It is not possible to add a virtual method to a non-leaf class (that is there is at least one derived class) as that anges the layout of the virtual table (vtable), whi is used to dispat virtual functions and access virtual base class subobjects. A vtable consists of a sequence of offsets, data pointers and function pointers[7]. By adding a new virtual function an entry has to be added to the vtable whi is not present in the existing vtables for derived classes[39]. As illustrated in Figure 4.8 on page 50 the client side of Mail-Shake has to implement an interface provided by the Mail-Shake library. e C++ equivalent to an interface as known from e.g. Java

²A binary incompatibility issue of the KDE window manager KWin in KDE SC 4.3 ³is tenique is also known as “pimpl”, as “handle/body” or as “eshire cat” 56 4 Development of the Systems is a pure virtual class. at is a class whi consists only of pure virtual (abstract) functions. A class with pure virtual functions cannot be instantiated and therefore a derived class is required whi implements these functions[61]. It is obvious that this class is a non-leaf class and by that new virtual functions cannot be added without breaking binary compatibility. As the notifications sent from Mail-Shake are designed as methods in the interface the example of adding a new notification provided above would break binary compatibility as well. is implies that the implementation may not use virtual methods in order to be able to extend the Mail-Shake library in future possible releases. Instead of invoking methods of a pure virtual class the “Signal and Slot” concept, a variant of the Observer paern[23], can be used. e library has to emit signals and the client side implementation has to provide mating slots and has to connect these to the signals. e concept can be compared to simple callbas. For the Mail-Shake library the boost signals library is used whi is discussed in more detail in Section 4.3.2.1. ese two measurements to not use virtual methods and using a d-pointer should help to develop the library in a way that it is possible to guarantee binary compatibility if required. Of course this pulls in some disadvantages. So a client side implementation cannot reimplement methods of the library and the library code becomes more complex due to the usage of a d-pointer.

4.3.1.2 Managing the Whitelist

One of the most important topics of the Mail-Shake library is managing the whitelist. Ea time a private mail is received the whitelist has to be seared for a mat to decide if the mail has to be dropped. As well if a sender successfully authorized himself by solving the Mail-Shake allenge a new entry has to be added to the whitelist. As the whitelist has to be managed by the user as well an entry offers different mating te- niques:

• Case sensitive mating

• Case insensitive mating

• Case sensitive part mating

• Case insensitive part mating

• Regular expression mating.

Especially the regular expression mating has some implications for the implementation. If the expression mates a mail header is only known aer the mat is applied. at is a mail has to be mated against ea regular expression. erefore the complete whitelist has to be in memory and so using a relational database does not provide advantages as the complete table needs to be 4.3 Implementation of Mail-Shake 57

eed. Only for case sensitive mating of addresses the sear of a database might be useful. But therefore an index has to be created on the column holding the address. An index allows to directly access the data referenced by a sear criteria. If we have a look on the whitelist entry as illustrated in Figure 4.6 on page 49 the filter is the dominant data and an index on that column is more or less equivalent to the complete table and so using an index on the named column is equivalent to keeping the complete table in memory. is is rather obvious for the case that the index is stored as a binary tree or a variant of one as these data structures are designed to be kept in main memory[32].

So using a relational database management system to sear for mates in the whitelist is no option. e whitelist has to be kept in memory and querying the whitelist has to be implemented in the Mail-Shake library. Of course keeping the whitelist in memory may be bad if the whitelist has many entries. But in case of Mail-Shake this is not to be expected. It is unlikely that the whitelist has more entries than a normal address book, whi is kept in memory as well. In fact the whitelist is most likely smaller than an address book as it only has an entry for the email address and organi- zational addresses can be mated with a part mating rule su as “@uni-heidelberg.de” to mat all addresses from the University Heidelberg.

e whitelist consists of two different types of entries:

1. Automatically generated entries by Mail-Shake and

2. Entries created manually by the user.

In the first case all entries mat addresses and not generic headers. A case insensitive mat on the complete string is performed. e temporary whitelist in fact consists only of entries of that kind. e entries created by the user might mat that format or mat a special header with for example a regular expression.

As the entry supports different mating strategies the class WhiteListEntry provides a method to compare a given string with the filter of the entry in a general way. As illustrated in Listing 4.1 the method just returns true if the used filter mates the given string for the mating strategy of that entry. Of course it is possible to use a specific class for ea mat type, but therefore a virtual method is required. As the code is still easy to read and understand this is still a good compromise to the variant of a perfect object-oriented design. In fact the perfect object oriented design requires to add quite some boilerplate code. 58 4 Development of the Systems

bool WhiteListEntry::matches(const std::string& data) { switch (d->match) { case ExactMatch: return (data.compare(d->filter) == 0); case CaseInsensitiveMatch: { std::string lowerData = boost::to_lower_copy(data); std::string lowerFilter = boost::to_lower_copy(d->filter); return (lowerData.compare(lowerFilter) == 0); } case PartMatch: return (boost::find_first(data, d->filter)); case CaseInsensitivePartMatch: { std::string lowerData = boost::to_lower_copy(data); std::string lowerFilter = boost::to_lower_copy(d->filter); return boost::find_first(lowerData, lowerFilter); } case RegExMatch: { boost::regex e(d->filter); return boost::regex_match(data, e); } } return false; }

Listing 4.1: Mating a string against a whitelist entry e Mail-Shake library passes the whitelist to and fro the client side implementation as a linked list as illustrated in the class diagram in Figure 4.7 on page 49 (see also the API documentation in Appendix C). So it would make sense to use the whitelist in the library as a linked list as well. But this has some disadvantages. Given the structure of the whitelist entry it is not possible to sort the list. It is the same problem as already discussed for the usage of a relational database. In case of the temporary whitelist a test if an address is on the whitelist is performed in O(n) as ea element of the list has to be eed. On the other hand in the case of the temporary whitelist the entries can be sorted. Ea entry is a case insensitive mat on an address. So the list could be sorted by storing all filters as lower case text. In a sorted list an entry can be found by binary sear reducing the algorithm complexity to O(log(n)). As the sorting is only useful in this case the class WhiteListEntry cannot provide an implementation of the operator<=() whi is required to sort a list. So the temporary whitelist cannot be stored in a linked list, but in a map whi provides item lookup, insert and deletion in O(log(n))[61] using the lower case address as the key and the entry as the value. is comes with an overhead in the interaction with the client side implementation as a list has to be constructed or the list has to be converted into a map. As the temporary whitelist is only passed to the library at startup and when it is cleaned up and is considerable small this overhead can be neglected. For the temporary whitelist this improvement is in praxis rather small. An entry is only added to the temporary whitelist when the user sends a mail and the recipient’s address is not listed on 4.3 Implementation of Mail-Shake 59 the permanent or temporary whitelist. Furthermore the entries on the temporary whitelist can be deleted by the client side implementation aer a specified period of time. By that we can assume that the temporary whitelist is rather small. bool MailShakePrivate::privateMailOnWhitelist(const EMail &mail) { // test if there is an address match if (entry(whitelist, mail.fromAddress())) { return true; } typedef std::pair*> headerPair; BOOST_FOREACH (headerPair p, mail.headers()) { // headerPair consists of the name of the header (first) // and a pointer to a list of strings (second) for the header // we have to iterate on the list of strings BOOST_FOREACH (std::string value, *p.second) { // if one of the values returns an entry in the whitelist we have a match if (entry(whitelist, value, p.first)) { return true; } } } // not on whitelist return false; }

Listing 4.2: Trivial algorithm to e if a mail is whitelisted For the permanent whitelist su an optimization is more important. In Listing 4.2 the trivial approa for finding a mating whitelist entry is illustrated. First it is eed if the address mates one of the address filtering whitelist entries by calling the method entry(), whi is illustrated in Listing 4.3. is method just loops through all whitelist entries to find a mating entry. It’s obvious that the case of address filtering is linear to the number of entries in the whitelist. Aer eing the address filtering ea mail header is eed against ea whitelist entry by calling again the method entry(). As there is not a mating whitelist entry for most of the headers it is quite a waste of resources to use an algorithm with a complexity of O(n · m) ≈ O(n2) ignoring the added complexity of a string comparison. e performance can be improved for the address filtering in the same way as for the temporary whitelist resulting in a e with complexity O(log(n)). But also for the headers the algorithm can be improved by using a smarter data structure. Instead of using one linked list a map can be used as well. e lowercase name of the mail header is used as a key and all whitelist entries for this header are saved in a linked list as the value of the map. is improved algorithm is shown in Listing 4.4. Ea mail header is only mated against the entries for this header and not compared to other headers any more. Finding the list of possible mating entries is performed with a complexity of O(log(n)). As ea header has to be tested the complexity (again ignoring string and/or regular expression comparison) is O(n · log(n)) whi is a significant improvement compared to the trivial 60 4 Development of the Systems algorithm. Of course this comes again with an overhead for saving and loading the whitelist whi is only required on startup and when the user manually anges the whitelist.

WhiteListEntry *MailShakePrivate::entry(const std::list< WhiteListEntry* > &list, const std::string &data, const std::string &header) const { WhiteListEntry *ret = NULL; std::string headerLower = boost::to_lower_copy(header);

BOOST_FOREACH(WhiteListEntry *entry, list) { if (!header.empty() && !entry->isAddressFiltering()) { if (boost::to_lower_copy(entry->header()) != headerLower) { continue; } if (entry->matches(data)) { ret = entry; break; } } if (header.empty() && entry->matches(data)) { ret = entry; break; } }

return ret; }

Listing 4.3: Comparing the whitelist entries to a given datum

typedef std::pair*> headerPair; BOOST_FOREACH (headerPair p, mail.headers()) { const std::string header = boost::to_lower_copy(p.first); std::map >::const_iterator it = whitelist.find(header); if (it == whitelist.end()) { // no whitelist entries for this header, so no need to check continue; } BOOST_FOREACH (std::string value, *p.second) { // if one of the values returns an entry in the whitelist we have a match if (entry(it->second, value, header)) { return true; } } }

Listing 4.4: Improved algorithm to test if a mail is whitelisted based on a smarter data structure⁴

⁴Identical comments to the one shown in Listing 4.2 are removed 4.3 Implementation of Mail-Shake 61

4.3.1.3 Managing the Storage of Unique Identifiers

Next to the whitelist Mail-Shake needs another data structure for the unique identifiers generated whenever a mail is received for the public address. Most of the points discussed for the whitelist hold for the id storage as well. Ea time a unique identifier is generated it has to be verified that the identifier is not yet used. erefore the current storage has to be seared for the generated identifier. If the storage is not sorted as ea new identifier is added to the end of the list the lookup is performed in O(n). By that inserting a new item needs linear time although appending an item to a linked list can be performed in constant time. erefore the storage has to be sorted, so that inserting a new item is performed in O(log(n)). Instead of using a linked list a map is used with the string representation of the identifier as the key and the MailShake::Id as value. e storage is passed to and fro the client side implementation as this map and by that there is not the overhead as seen for the whitelist, whi is passed as a linked list. Given the fact that all Ids are numbers, a possible improvement is the usage of integers as the key in order to eliminate the string comparison. On the other hand this removes the possibility to generate unique Ids whi cannot be represented by a numerical value. void MailShake::publicMailReceived(const EMail& mail) { // drop delivery status notifications if (dynamic_cast(&mail)) { return; } if (mail.isMailShakeChallenge()) { return; }

EMail reply = mail.replyMail(); if (reply.toAddress().empty()) { // There is no recipients - drop it return; } // generate unique id and store in storage std::string uid = d->nextId(); reply.setChallengeId(uid); Id *id = new Id(uid, mail.fromAddress(), boost::posix_time::second_clock::universal_time()); d->idStorage[uid] = id; // emit signal d->idAdded(id); d->challenge(reply); }

Listing 4.5: Handling the receipt of a mail sent to the public address A new identifier is generated when a mail is sent to the public address. e processing is stopped in case of a delivery status notification and a Mail-Shake allenge. e code for processing a mail 62 4 Development of the Systems received for the public address is shown in Listing 4.5. e generated Id is used for the Mail-Shake allenge mail, saved in the local storage and the client side implementation is informed about the new identifier, so that it is saved durable, and the allenge mail is passed for sending by invoking the signals. e actual identifier is generated in the private method nextId() whi returns a string. is rather trivial method is illustrated in Listing 4.6. Ea possible identifier is returned by generator() - a pseudo random number generator provided by the boost libraries. It uses the mt19937 number generator, whi is fast and has acceptable quality for a uniform distribution. e number generator is a specialization of the Mersenne Twister pseudo-random number generator[41]. For the uniform distribution the numbers between 10,000 and 99,999 are used - that is all numbers with five digits. std::string MailShakePrivate::nextId() { std::string uid; do { uid = boost::lexical_cast(generator()); } while (idStorage.find(uid) != idStorage.end()); return uid; }

Listing 4.6: Generating a new unique identifier As it is unlikely but possible that the random number generator returns an already used number the generated numbers have to be probed till an unused is found. Of course there is the ance that Mail-Shake is stu in an endless loop if many or all possible identifiers are used. But there are 90,000 possible identifiers and an identifier is only used for a few days. Receiving 90,000 mails on one day can be considered as a denial of service aa on the MTA and is probably bloed by the ISP⁵ or exceed the disk quota of the mail storage. string EMail::challengeResponseId() const { std::string uid = d->challengeResponseId; if (uid.empty()) { // there is no header, we have to check the subject std::string subject = boost::to_lower_copy(d->subject); boost::regex e("((.)*(mail-?shake)(\\s|-)(id:?)(\\s)*(\\d{5})(.)*)"); boost::match_results matches; if (boost::regex_search(subject, matches, e)) { uid = matches[7]; } } return uid; }

Listing 4.7: Extracting the Mail-Shake Id from a mail subject e identifier is needed when a private mail is received to decide if a sender’s address can be

⁵Internet Service Provider 4.3 Implementation of Mail-Shake 63

Possible Subject Identifier can be extracted? A Subject No, Mail-Shake Id is missing Mailshake Id: 12345 ✔ Mailshake Id: 1234 No, Id is too short Mail-Shake-Id123456 ✔ A subject(MAILSHAKE ID 12345) ✔ Mail Shake Id 12345 No, white space in “Mail-Shake” is not allowed mailshake-id 12345 (original subject) ✔ 12345 No, variant of “Mail-Shake id” is missing A subject (Mail-Shake-Id: 12345) more text ✔

Table 4.1: Examples for subjects containing a Mail-Shake id added to the permanent whitelist. erefore both the identifier set in the subject or in the special X-Mailshake-Response-ID header and the address have to mat one entry in the storage. Ex- tracting the identifier from the mail is trivial if the header is set. Unfortunately the header can only be set if there is a Mail-Shake integration in the email clients, whi in general does not yet exist. erefore the subject has to be parsed with a regular expression as illustrated in Listing 4.7. e reg- ular expression is rather generic and does not enforce the sender to type the Id in one exact way. In fact it can be added to a normal subject whi is probably the best way as the sender can just resend the original mail sent to the public address with the Id appended to the original subject. e only requirement is that the subject contains “mail-shake-id:” and five digits for the Id. Of course this is case insensitive and the colon, hyphen and white spaces are partially optional. Some examples of valid and invalid subjects are presented in Table 4.1. If the received mail contains an identifier whi is not present in the storage the client side im- plementation is invoked in order to send an optional notification. As well if the identifier is present in the storage but the address is not mating the client side implementation is notified. If both are mating the Id can be deleted and an entry can be added to the whitelist. Of course the entry needs not to be added if there already is one present and in case there is one on the temporary whitelist this one just has to be moved to the permanent.

4.3.1.4 Handling of Received and Sent Mails

During the discussion of the management of the unique identifications above the receipt of public mails has already been mentioned. Receiving private mails was addressed as well as it is required for removing mails. Nevertheless most of the algorithm has not been discussed and is presented here. e algorithm for handling the receipt of private mails decides if a mail is dropped or presented to the user. If it is faulty Mail-Shake produces either false positives or false negatives and both is 64 4 Development of the Systems unacceptable. As the client side implementation might offer a processing delay, so that the user can manually add a sender to the whitelist a method for testing if the mail would be dropped is required. is method has to provide the same functionality as the “real” method without anging the internal data structures, that is for example not adding an address to the whitelist or sending out notification mails. is requirement that the internal data structures are kept unanged can be verified at compile time by declaring the method to be const. is means that the method cannot modify the member variables or call non-const member functions[5]. e other requirement that the test method has to work in the same way as the final method can be verified with unit tests. In ea test where the final method is called, first the test method has to be called and the result has to be identical. Of course this does not guarantee that the methods are identical as that cannot be guaranteed by unit tests. Nevertheless everything that is tested is guaranteed to produce the same result. As the source code of testPrivateMailReceived(const EMail& mail) is easier to under- stand than privateMailReceived(const EMail& mail) as it does not have to add or remove entries from whitelists or remove identifiers, the discussion of the functionality concentrated on the test method. Both methods are in fact a straight forward implementation of the requirements discussed in Section 4.1.5 and illustrated in Figure 4.4. // email is on whitelist? if (d->privateMailOnWhitelist(mail)) { return true; }

// check for Delivery Status Notifications if (const DSN* dsn = dynamic_cast(&mail)) { // reply to a email sent from this method will have a // X-Mailshake-URL header, those DSN should be dropped // all other DSN should be given to the user if (!dsn->originalEMail().challengeURL().empty()) { return false; } else { return true; } }

Listing 4.8: Cheing if received private mail is whitelisted or a DSN

First of all the method has to e if the mail is on the permanent whitelist using the algorithm described in Chapter 4.3.1.2 (compare Listing 4.2 and 4.4). In that case the mail can be accepted and the processing can be stopped by returning true as illustrated in Listing 4.8. e next step is to test if the mail is a delivery status notification. In Mail-Shake a DSN is a class derived from EMail so by using a dynamic cast it is possible to test if the object is an instance of DSN. If the object is not an instance of DSN the cast returns a null pointer and the if statement is false. Otherwise the pointer 4.3 Implementation of Mail-Shake 65 is set and can be used to test if the DSN is in response to a Mail-Shake notification. In that case the DSN can be dropped. If the aaed original mail is not in response to a Mail-Shake notification or there is no aaed mail at all the DSN cannot be dropped as it might be in response to a mail sent by the user. In both cases there is no need to further process the mail. const std::string uid = mail.challengeResponseId(); if (!uid.empty()) { if (d->idStorage.find(uid) == d->idStorage.end()) { // storage does not contain the id, obviously the id is wrong return d->isOnTemporaryWhitelist(mail.fromAddress()); } Id* id = d->idStorage[uid]; if (boost::to_lower_copy(id->address()) != boost::to_lower_copy(mail.fromAddress())) { // address is not correct, but Id is correct return d->isOnTemporaryWhitelist(mail.fromAddress()); } else { return true; } } else { // no MailShake Challenge reply email // check for temporary whitelist return d->isOnTemporaryWhitelist(mail.fromAddress()); }

Listing 4.9: Cheing if mail contains a allenge response Id or is on temporary whitelist

e next processing step as shown in Listing 4.9 is to e if the mail is a reply to a Mail-Shake allenge. at is the mail must either contain the Id in the special header or in the subject. If su an Id is present the storage has to be seared for it. Of course both Id and sender address have to mat. If there is no Id present or either Id or address are incorrect the final processing step is to test if the address is whitelisted by the temporary whitelist. e method privateMailReceived(const EMail& mail) has to modify the data structures accordingly. at is in case there is a mating identifier, the identifier has to be removed from the storage and an entry has to be created for the whitelist. e code for creating this entry is illustrated in Listing 4.10. e shown method also handles the case that there is already an entry on the temporary whitelist and moves this to the permanent, but only in the case that there is not yet an entry for this specific address on the permanent whitelist. In fact this is more or less just a security e as shown above the algorithm returns previously if there is an entry on the permanent whitelist so the e for a temporary whitelist entry or the Id is not performed. Last but not least the implementation of the requirement for the handling of sent mails as illus- trated in Figure 4.3 on page 45 is discussed. From the diagram itself it is obvious that this part is rather small. e implementation has to extract all addresses from the mail header and insert an entry on the temporary or permanent whitelist. is guarantees that a response to a sent mail is not dropped by Mail-Shake. erefore it is important that ea address is added to the whitelist. A 66 4 Development of the Systems problem in this regard are the “Blind Carbon Copies” (Bcc), whi are used for recipients of the mail without revealing the address to other recipients of the same message. void MailShakePrivate::moveOrAddEntryToPermanent(const std::string& address) { WhiteListEntry *entry = removeTemporaryWhiteListEntry(address); // only add if it whitelist does not contain an entry matching the address if (!this->entry(whitelist[""], address)) { if (!entry) { entry = new WhiteListEntry(); entry->enableAddressFiltering(true); entry->setFilter(address); entry->setMatchType(CaseInsensitiveMatch); entry->setDate(boost::posix_time::second_clock::universal_time()); } whitelist[""].push_back(entry); whitelistAdded(entry); } else { delete entry; entry = NULL; } }

Listing 4.10: Move an entry from temporary to permanent whitelist or create a new one.

BOOST_FOREACH(const std::string &recipient, recipients) { if (d->entry(d->whitelist[""], boost::to_lower_copy(address))) { continue; } if (d->isOnTemporaryWhitelist(address)) { continue; } // mail is neither on temporary nor on permanent whitelist WhiteListEntry *entry = new WhiteListEntry(); entry->enableAddressFiltering(true); entry->setFilter(address); entry->setMatchType(CaseInsensitiveMatch); entry->setDate(boost::posix_time::second_clock::universal_time()); if (mail.isMailShakeChallengeReply()) { // we can assume that the email address is a valid email d->whitelist[""].push_back(entry); d->whitelistAdded(entry); } else { d->addTemporaryWhitelistEntry(entry); d->temporaryWhitelistAdded(entry); } }

Listing 4.11: Adding a whitelist entry for ea recipient of a sent mail

e implementations for sending mails in general remove the Bcc header at least for the recipi- ents specified in the “To” and “CC” headers or for all recipients and send a modified copy to ea recipient[53]. As Mail-Shake is only invoked aer mails were sent by an email client it is not known 4.3 Implementation of Mail-Shake 67 if there are Bcc recipients and it might be that the header is removed. If it is removed Mail-Shake produces false results. erefore it is the responsibility of the client side implementation to ensure that the Bcc header is included. Unfortunately the implemented Akonadi Agent does not send mails and has to rely on the mails produced by other email clients. If a sent mail is stored without the Bcc header there is no way for the Akonadi Agent to pass the mail with Bcc headers set to Mail-Shake. For ea recipient Mail-Shake has to add an entry to the temporary whitelist if there is not yet an entry on the temporary or permanent whitelist. is is illustrated in Listing 4.11. e entry whi is constructed and inserted into the whitelist is a special entry for the temporary whitelist. In case that the sent mail is in reply to a Mail-Shake allenge, the entry can safely be added to the permanent whitelist as it’s definitely a private address. A mail not in reply to a Mail-Shake allenge might be a mail sent to a public Mail-Shake address and by that may only be added to the temporary whitelist.

4.3.1.5 Multi Threading

Multi threading or concurrency can help to improve the performance of applications. A thread is a lightweight process with an own sta but using the address space of the parent process[63]. Using multiple threads in Mail-Shake would allow to process several mails in parallel. As the discussion so far showed, the implemented library does not make use of multiple threads. In this Section it is discussed why Mail-Shake does not use threads and that it is implemented to be used in a single threaded environment and is not thread safe. Mail-Shake is designed to be run on a client maine and not in conjunction with an MTA. Multi threading is only useful if data processing is run in parallel, so in case of Mail-Shake if several mails are processed at the same time. is is in the case of Mail-Shake rather unlikely. Even if several mails are feted at the same time, they are processed in serial instead of parallel. At least both common protocols IMAP and POP3 use one connection to the server to retrieve new mails[14, 46] so the mails are passed to Mail-Shake in serial. Nevertheless it is possible but unlikely that one mail is still processed when the next becomes available. e Mail-Shake library is implemented with Akonadi in mind. Akonadi uses completely asyn- ronous connections and new arrived mails are feted by the agents using an IMAP like proto- col[71]. Of course it is possible to create one thread for ea received mail in the Akonadi agent. But that creates a significant overhead for the cases that only one mail has to be processed. Furthermore it is questionable why the processing of received mails should be done in parallel. Multi threading provides advantages if long lasting processing is done in the baground, so that the GUI thread is still responsive[5]. Mail-Shake is a complete baground process. ere is no need to keep the user interface responsive as there is no user interface. ere are no advantages for the user from this point of view. So to say it is useless to use multi threading in Mail-Shake. Most of the time only one mail is 68 4 Development of the Systems processed and by that the need for multi threading is just not given. Although the use of multi threading does not make sense, it is interesting to know if the Mail-Shake library is able to be run in a multi threaded environment. ere are of course trivial cases where Mail-Shake is thread safe. read safe means that a function can be called concurrently from different threads on the same data and the result is always defined[5]. For example if two mails for the private address are processed and both sender addresses are on the permanent whitelist, Mail-Shake produces the correct result.

But it is also easy to construct a case where the Mail-Shake implementation is not thread safe. erefore we have a look again on Listing 4.10 on page 66. e illustrated code first removes an entry from the temporary whitelist and adds it to the permanent. e given case is that an address is on the temporary whitelist and Mail-Shake is invoked on two private mails for this address in parallel. read A enters the method moveOrAddEntryToPermanent() and executes the first statement resulting in the white list entry to be removed from the temporary whitelist. At this point the seduler start read B and enters the method for processing private mails. e entry is removed from the temporary whitelist and has not yet been added to the permanent. So the mail is dropped although it is whitelisted. is shows that at least the method privateMailReceived is not thread safe.

By adding a mutex to this method, it could be made thread safe. But the used data structures seem not to be thread safe. At least some implementations of the Standard Template Library (STL) are not thread safe[13] and by that we have to assume that no implementation of the STL is thread safe. Furthermore the boost signal and slot library is not thread safe. Nevertheless, there is a new thread safe implementation available[27], whi was released in May 2009.

e fact that the STL containers are not thread safe helps to construct more cases where Mail- Shake is not thread safe. In case that one thread is in publicMailReceived and one thread is in privateMailReceived and both ange the Id storage, it is possible that a lost update occurs. E.g. one thread removes one entry, while the other thread adds an entry, the added entry might be missing due to a race condition. is results in false positives if the receiver of the Mail-Shake allenge wants to send a mail to the private address.

is means that ea method whi anges the data structure has to be protected by the same mutex. Considering that ea library entry point method might ange the data structure the com- plete object has to be protected by the mutex. So a maximum of one thread could be running in the object and by that the complete advantage of multi threading is lost. From a ’s point of view it is easier to say that the Mail-Shake library is not thread safe than to include a mutex to provide thread safety for an unlikely cause of parallel processing of mails. 4.3 Implementation of Mail-Shake 69

Source Directory Number of Source Lines of Code lib 848 lib/private 328 tests 973 agent 1027 agent/config 1653 agent/config 747 (XML user interface) Sum 4829 (C++)

Table 4.2: Size of Mail-Shake measured in Source Lines of Code⁶

4.3.2 Mail-Shake Akonadi Agent

In this Section the Akonadi agent is discussed. e agent is the tie between Akonadi and the Mail- Shake library. It’s task is on the one side to provide the client side implementation and to pass mails fro and to Akonadi. Although the agent is more than twice the size of the library, only a small part is discussed as most of the code is wrien for the configuration user interface and therefore very specific to the Qt libraries. An overview of the size in source lines of code of the various parts of Mail-Shake is provided in Table 4.2.

4.3.2.1 Interaction with Mail-Shake Library

As mentioned above Mail-Shake uses the boost Signals library for the interaction between library and client side. e library part of invoking the signals was already shown in e.g. Listing 4.11 on page 66. e client side has to provide one mating method as a slot for ea interested signal. ere are signals a client side might not be interested in. E.g. if the user does not want to send notifications when a mail, received for the private address, is dropped, there is no need to provide a slot for that signal. As the Agent is also thought as the reference implementation for the Mail-Shake library it provides slots for ea signal. void MailShakeAgent::init() { // [...] m_mailShake.signalIdRemoved(boost::bind(&MailShakeAgent::slotIdRemoved, this, _1)); // [...] }

Listing 4.12: Connecting a slot to the signal with the boost library Listing 4.12 illustrates how one of the slots is connected to the signal. e slot is as mentioned a normal public method. In this example the signal for removing one unique identifier from the

⁶Measured with David A. Wheeler’s “SLOCCount” 70 4 Development of the Systems storage is connected. e mating method is illustrated in Listing 4.13 whi just deletes the Id from the used baend database. e other signals are connected in a similar fashion. Interaction from the agent to the library is in fact even less complex as it is just simple method invocation. void MailShakeAgent::slotIdRemoved(const Id *id) { QSqlQuery query; const int uid = QString::fromStdString(id->uniqueId()).toInt(); query.prepare("DELETE FROM storage WHERE id=:id"); query.bindValue(":id", uid); query.exec(); }

Listing 4.13: Slot for removing one Id from the storage e agent uses two different Signals and Slots teniques: the one provided by the boost libraries and the one used by Qt. e first one uses C++ templates while the laer one uses macros and a special code generator, the Meta Object Compiler. Both teniques have their advantages and disadvantages[11]. Looking just at the syntax the Qt Signals and Slots concept is preferable, as for example signals and slots are classified in the header file and by that become part of the API documentation. As well thanks to the usage of macros connecting signals and slots is more straight forward. An example is provided in Listing 4.14. Compared to the template based connection, the code states clearly that a connection is established and whi signal is connected to whi slot. is increases the code readability significantly. connect(timer, SIGNAL(timeout()), this, SLOT(slotCleanup()));

Listing 4.14: Connecting Signals and Slots with Qt

4.3.2.2 Storage

As mentioned above it is the task of the client side implementation to store the whitelists and identi- fiers durable. e Akonadi agent uses a SQLite database as this is part of the Qt libraries and by that does not require further dependencies. e usage of a lightweight, in-process database is sufficient for the needs of Mail-Shake because the expected data is rather small. e evaluation showed that even aer several weeks of usage the database is still less than 10 KB in size. Mail-Shake requires three tables: one for the permanent, one for the temporary whitelist and one for the identifiers. ese three tables are illustrated in Table 4.3. e database is created at startup if it does not yet exist. As well on startup the tables are read and passed to the library. Data is only inserted into the tables aer a signal is emied from the library as seen in Listing 4.13. A special case is a seduled cleanup of old Ids and temporary whitelist entries. e agent offers a configuration option to remove issued Ids aer several days. e idea behind it is, that ea junk 4.3 Implementation of Mail-Shake 71

(a) Table Id Storage (b) Table structure for permanent and temporary whitelist Column Data Type Column Data Type id Integer (Primary Key) id Integer (Primary Key, auto increment) address Varar (100) header Varar (150) date Datetime filter Varar (150) type Integer date Datetime

Table 4.3: Database structure of Mail-Shake agent mail sent to the public address results in an added identifier, but this identifier will most likely never be removed. So the data storage grows consistently and therefore the possibility to remove old Ids is introduced.

4.3.2.3 Sending and Receiving Mails

Akonadi uses the concept of Collections: ea mail folder is represented as a collection and ea mail, an Item for Akonadi, is part of a collection. Mail-Shake receives mails for a private and a public address and needs to be informed about sent mails. e agent assumes that mails for private and public addresses do not end in the same collection. Nevertheless it is possible that several collections are used for the private addresses, e.g. in case of filtering. e agent is informed by Akonadi whenever a new mail arrives. In order to reduce the number of wakeups the agent can constrain the information to the collections it is interested in. if (m_publicCollections.contains(collection.id())) { if (item.hasFlag("\\Seen")) { // not interested in seen messages return; } ItemFetchJob *job = new ItemFetchJob(item); job->fetchScope().fetchFullPayload(); connect(job, SIGNAL(itemsReceived(Akonadi::Item::List)), this, SLOT(publicItemsReceived(Akonadi::Item::List))); job->start(); }

Listing 4.15: Feting a mail sent to the public address When a new Item is added the agent has to e if it is for the private or the public collection. If it is a mail Mail-Shake is interested in, the agent fetes the complete mail in an asynronous manner as illustrated in Listing 4.15. As soon as the mail is retrieved a signal is emied and the slot is invoked. Of course the mail is not in the right format to be passed to the Mail-Shake library as it is rep- resented as a KMime message. erefore the message has to be transformed into an Mail-Shake 72 4 Development of the Systems

EMail object to be passed to the library for further processing. Mail-Shake only operates on the mail headers, the body does not maer. So ea required header as for example the subject has to be extracted from the KMime message as illustrated in Listing 4.16 and passed to the mating seer method of the EMail class. if (Headers::Subject *subject = message->subject(false)) { receivedMail->setSubject(subject->as7BitString(false).constData()); } if (Headers::Base *id = message->headerByType("X-Mailshake-Response-ID")) { receivedMail->setChallengeResponseId(id->as7BitString(false).constData()); } if (Headers::Base *id = message->headerByType("X-Mailshake-ID")) { receivedMail->setChallengeId(id->as7BitString(false).constData()); }

Listing 4.16: Extracting headers from a KMime message e conversion of a mail must also consider delivery status notifications. Due to limitations of the KMime library this is not a trivial issue. e API itself is straight forward: all message parts are modeled as a “Content” and provide a geer for its MIME type. So the idea is to test if the MIME type of the message is “multipart/report” and e all Contents for one with MIME type “message/rfc822” and parse this aaed mail to test if it has headers set by Mail-Shake. Unfortunately KMime fails at parsing the multipart message writing the following message to stderr: kdepimlibs (kmime) KMime::Content::parse: Failed to parse multipart. Treating as text/plain. is implies that a delivery status notification is treated as a normal mail without even recog- nizing the aaed mail. Mail-Shake processes a delivery status notification just like ea other mail including sending allenges and notifications, whi is not the desired behavior. Because of that the implementation of the Akonadi agent tries to workaround this issue by parsing the mail itself by accessing the raw content before it is parsed by KMime. While this workaround solved the issue during development, it did not work in a productive environment during the evaluation. e KMime message is already parsed before being passed to the agent and by that also accessing the raw content fails as the correct content is only returned if the message has not been parsed. e reason for this problem might be the fact that KMime is unable to handle encapsulated messages. A pat for this issue to be included in the next version of the KDE development platform has been proposed[43]. e agent is also responsible for sending out Mail-Shake allenge mails and notifications. ere- fore a Mail-Shake EMail has to be converted into a KMime message, so to say the opposite to the discussion above. is requires to generate the appropriate mail headers, whi are listed in Ta- ble 4.4, as well the body. e body itself is completely configurable by the user, whi makes it possible to write custom messages in different languages. e used template provides some tokens for adding the allenge Id and URL to the body. An example template for the Mail-Shake allenge 4.3 Implementation of Mail-Shake 73

Header Value To Sender of received mail From configurable In-Reply-To Message-id of received mail User-Agent MailShake Akonadi Agent v0.0.1 Date Current date Subject configurable Message-id generated Content-Type text/plain MIME-Version 1.0 X-Mailshake-ID generated id (only for allenge mails) X-Mailshake-URL configurable

Table 4.4: Mail headers used in Mail-Shake allenge and notification mails

You sent an email to a public Mail-Shake address. The email will not be delivered. You have to send the email to the private address. You can retrieve the private address by visiting the following web address and solving the shown CAPTCHA:

<%CAPTCHA_URL%>

The email to the private address will only be delivered if you include the following text in the subject of your email:

Mail-Shake Id: <%MAILSHAKE_ID%>

In future you can send emails directly to the private email address as normal. If you did not send an email to the public address you can ignore this email.

This message was generated automatically. Please do not respond.

Figure 4.12: Template of a Mail-Shake allenge is provided in Figure 4.12.

4.3.2.4 Configuring the Whitelist

A user of Mail-Shake must be able to manage the whitelist. He must be able to manually add entries to the whitelist for the communication with automated systems and be able to delete entries in case junk mails are received from that address. e Akonadi agent must therefore offer a configuration interface for the whitelist as illustrated in Figure 4.13(a). is configuration interface offers to edit all details of an entry including the date, the header, filter and mating strategy. It is possible to add a new empty entry or to generate entries based on all mails in a given mail folder. In this case the agent looks through all mails in one folder and extracts all sender and presents them in a dialog, whi is illustrated in Figure 4.13(b). e user can easily select all the addresses he wants to be added to the whitelist. is is an important feature for the first setup of Mail-Shake as it allows to add all existing contacts to the whitelist and removes 74 4 Development of the Systems

(a) Configure the whitelist (b) Add addresses found in mail folder

Figure 4.13: Dialogs to configure the whitelist the burden of authentication. A sender might be informed when a Mail-Shake user withdraws the address from the whitelist. In case junk mails are received from that address it might be possible that the sender’s computer is compromised by malware whi got hold of the address book. Informing the sender about the possi- ble infection is of course useful. Nevertheless it might not always be required to send a notification. If an address of an automated system is withdrawn from the whitelist there is no need to send a notification. Also privacy might be a reason to not send a notification in case the Mail-Shake user does not want the other party to know that the address is withdrawn. erefore the configuration interface asks if a mail should be sent when an entry for an address is removed. If yes the user has the possibility to edit a predefined message and oose the sender address.

4.3.2.5 Delaying Dropping of Mails

From a usability point of view the communication with automated systems is one of the most dif- ficult topics. On the one hand it has to be very easy for the user of Mail-Shake to add an address to the whitelist. On the other hand Mail-Shake should be in the baground as mu as possible. Mail-Shake has to be a baground service whi is not visible to the user. If it is not visible the user has no ance to easily add an address. If it is visible it is an annoying item whi does nothing most of the time except telling the user that Mail-Shake is running. Mail-Shake should only be visible when the user wants to know it, in all other cases it should not be visible to the user. Of course this is more or less an impossible task as Mail-Shake cannot know when the user registered his address at a new web service and is expecting a mail whose address 4.3 Implementation of Mail-Shake 75

Figure 4.14: Notification upon receipt of not whitelisted mail

has to be added to the whitelist. is problem can be solved by introducing the processing delay. A mail whi is not whitelisted is not discarded at once but a user configurable timespan is out passed first. During this delay the user has the ance to add the whitelist entry.

So while there is a processing delay the user has to be informed that there is a delay and he might need to access the configuration. Of course it is again important that Mail-Shake does not annoy the user by notifying the user in a too prominent way like for example a popup dialog. erefore a status notifier is used, whi is by default hidden and becomes only visible in the desktop panel with a small icon when there is a delay. Some of these notifiers are illustrated in Figure 6.1 on page 103 e status notifier allows to open the configuration to add a new entry to the whitelist. Status notifiers have been proposed by the KDE developer community as a specification for the free desktop[40] and is available in Plasma as of version 4.4 and an implementation is being wrien by Canonical’s Desktop Experience Team to be shipped with the GNOME desktop of the upcoming Ubuntu 10.04 release. If the specification is not implemented in the user’s system it falls ba to a more native variant whi does not support the automatic hiding of the icon.

e status notifier is the solution for not annoying while at the same time providing the required information. If a user expects a mail he just has to wait till the notifier becomes visible and add an entry for the address to the whitelist. In order to make this step as easy as possible the agent displays the addresses of all mails whose processing is currently delayed. If the user is not expecting a mail the appearance of the icon is not disturbing and the icon will be hidden aer the processing delay. If the user does not want to use a processing delay, he will not even see the icon at all.

In order to make Mail-Shake even more usable when awaiting a new mail from automated systems a notification is added. is notification is an optional feature and can be toggled in the context menu of the agent’s status notifier. When a new not whitelisted mail is received a notification is shown as illustrated in Figure 4.14. It names the sender and offers a buon to add the address directly to the whitelist. So the user can activate this notification as soon as he registers to a web service and disable it again aer receiving the initial mail and adding the address to the whitelist. is solves the problem of being informative without being annoying for Mail-Shake in an elegant way. 76 4 Development of the Systems

4.3.3 Mail-Shake Integration in Email Clients

4.3.3.1 Integration into Mailody

To help the user to solve a Mail-Shake allenge and to not cli the link in the allenge mail an integration into the email client is required. e client has to inform the user that the mail is a Mail- Shake allenge and offer a way to easily solve the allenge and resend the original mail without leaving the client. A nice to have feature would be to automatically update the address book entry. As all soware wrien in the scope of this thesis is built upon the Qt and KDE libraries the KDE email client KMail is the obvious oice to implement this integration. KMail already offers su integration parts, for example the result of Spam Assassin is displayed in the header section and cryptographically signed mails are colored and provide interactions to view signature information. In the next version of KMail, whi will be released as part of the KDE Soware Compilation 4.5, the rendering engine of the email viewer will ange from KHTML to its fork WebKit and the MIME implementation will be anged to the KMime library. For the integration the parsing of the mail headers has to be adjusted to extract the X-Mailshake- URL header value and the HTML representation of the mail has to be anged to provide the inter- action. As exactly these two points will ange in the next release the integration would be broken again and the work will be lost. So integrating into KMail is not the preferred solution. With Mailody there exists another KDE email client, whi is already ported to Akonadi, KMime and WebKit. An integration into Mailody will most likely allow to reuse the code with minimal adjustments in KMail as both applications will use the same tenologies.

4.3.3.2 Modifying the Header Information Widget

KMime::Headers::Base* mid = m_msg->headerByType( "X-Mailshake-ID" ); if ( mid ) m_mailshakeID = mid->asUnicodeString();

KMime::Headers::Base* murl = m_msg->headerByType( "X-Mailshake-URL" ); if ( murl ) m_mailshakeUrl = murl->asUnicodeString();

Listing 4.17: Extracting the Mail-Shake headers in Mailody Mailody caes the data it displays in a class called MessageData, whi parses the header in- formation in the method setMessage(KMime::Message* msg). is method has to be extended by code to extract the Mail-Shake allenge id and the Mail-Shake URL as shown in Listing 4.17. If the header does not exist the method headerByType() returns a null pointer and a simple test for the validity of the pointer can be used to verify that the header has a value. In that case the header value is stored in a string aribute and the class is extended by appropriate geer methods. 4.3 Implementation of Mail-Shake 77

Figure 4.15: Mailody Message View with Mail-Shake allenge mail integration

const QString mailshakeId = m_currentMessage->mailshakeID().trimmed(); const QString mailshakeUrl = m_currentMessage->mailshakeUrl().trimmed(); if ( !mailshakeId.isEmpty() && !mailshakeUrl.isEmpty() ) { const QString url( "mailshake:" + mailshakeId + ’?’ + mailshakeUrl ); text.append( "" + i18n( "This is a MailShake challenge. " "Your previous mail has not been delivered.
" "To authenticate your address click " "here", url ) + "" ); }

Listing 4.18: Displaying Mail-Shake allenge information in Mailody’s header widget

If the Mail-Shake headers are set the mail viewer can be altered to inform the user about the Mail- Shake allenge and offer to solve it. Mailody has a section to show some mail header information su as subject, recipient and sender. Into this section the Mail-Shake information can be added as illustrated in Figure 4.15. e advantage is, that it is clear to the user that the provided hyperlink is part of the email client and not part of the mail. Mailody constructs a simple HTML table for the header in the class MessageHeaderWidget. Here the Mail-Shake information can be added as another row as illustrated in Listing 4.18. e seme of the constructed hyperlink is set to mailshake:/ so that a cli on it does not open an external web browser. e cli is already intercepted in method slotLeftMouseClick() in order to open a message composer when a mail address is clied and to open aaments in an external application. As illustrated in Listing 4.19 this code is extended to open a dialog for solving the Mail-Shake allenge. 78 4 Development of the Systems

File Added Source Code Lines messagedata.h 11 messagedata.cpp 8 messageheaderwidget.cpp 17

Table 4.5: Changed files for Mail-Shake allenge integration in Mailody

void MessageHeaderWidget::slotLeftMouseClick( const QUrl& to ) { if ( !m_currentMessage ) return;

if ( to.scheme().compare( "mailto", Qt::CaseSensitive ) == 0 ){ emit openComposer( to.path() ); } else if ( to.scheme().compare( "attachment",Qt::CaseSensitive ) == 0 ){ KRun *run = new KRun( to.path(), this ); run->setRunExecutables( false ); } else if ( to.scheme().compare( "mailshake", Qt::CaseSensitive ) == 0 ){ QPointer dialog = new MailShakeDialog( to, this ); dialog->setWindowTitle( i18n("MailShake Challenge") ); dialog->exec(); delete dialog; } }

Listing 4.19: Intercepting a cli on a link in order to open the Mail-Shake allenge dialog e MailShakeDialog, whi is opened by cliing the link, is independent from Mailody and could in fact be moved into an own small library. e presented code fragments could be anged to a conditional build so that it is only built if the library is present. Altogether the anges to the source code of the email client are rather small. e number of added lines to the various code files are listed in Table 4.5 without the files for the dialog.

4.3.3.3 Dialog for Solving the CAPTCHA

e task of the MailShakeDialog, whose source code is listed in Appendix F, is to display the CAPTCHA to the user, validate it and finally present the private address. is has two advantages:

1. e allenge is presented in a secure way as part of the client

2. e allenge can be translated even if the web based CAPTCHA does not offer translations.

e biggest disadvantage is that the dialog breaks as soon as the provider anges the web page. e CAPTCHA has to be extracted from the page, so if the structure anges the CAPTCHA can no longer be extracted. Although this is unlikely to happen for example the scr.im integration will break as soon as the CAPTCHA becomes secure (see Chapter 3.2). During the last year the KDE community proofed that it is possible to embed anging web content in applications by using a 4.3 Implementation of Mail-Shake 79

(a) reCAPTCHA dialog with error message (b) scr.im dialog

Figure 4.16: Dialogs to solve the Mail-Shake allenge

plugin aritecture[36]. As well there are several web plugins wrien in ECMAScript whi are executed embedded in the application. If the web page anges only the scripted plugin has to be anged. is sounds like a reasonable approa for the Mail-Shake integration into email clients as well. QWebElement img = frame->findFirstElement("td.recaptcha_image_cell div img"); if (img.hasAttribute("src") && img.attribute("src").startsWith("http://api.recaptcha.net/image")) { const int width = img.attribute("width").toInt(); const int height = img.attribute("height").toInt(); QPixmap pixmap(width, height); pixmap.fill(Qt::transparent); QPainter painter(&pixmap); img.render(&painter); m_captchaWidget->setImage(pixmap); }

QWebElement hidden = frame->findFirstElement("div.recaptcha_input_area span input"); if (hidden.attribute("id").startsWith("recaptcha_challenge_field")) { m_captchaParameter = hidden.attribute("value"); return true; }

Listing 4.20: Extracting CAPTCHA from the reCAPTCHA web page

e dialog has to download the web page specified by the Mail-Shake allenge URL and pass it to the specific implementation for reCAPTCHA or scr.im. ese two possible dialogs are illustrated in Figure 4.16. e interaction with the HTTP servers is implemented using Qt’s WebKit library. is library allows to load a web page in the baground. As soon as the page is loaded a signal is 80 4 Development of the Systems emied whi can be used to operate on the page and extract some elements from the DOM⁷ tree and render them in an off screen image. is is illustrated in Listing 4.20 for the reCAPTCHA service. e CAPTCHA is rendered into a pixmap and a required aribute for solving the CAPTCHA is extracted and stored as well. // test if we have the address QWebElement address = frame->findFirstElement("div center b a"); if (address.hasAttribute("href") && address.attribute("href").startsWith("mailto:")) { m_address = address.toPlainText(); KMessageBox::information(this, m_address); emit captchaSolved(); return true; }

Listing 4.21: Testing if the web page contains the revealed mail address As soon as the user solves the allenge the POST message has to be constructed and sent to the server. Based on the result the CAPTCHA specific implementation has to decide if the allenge is solved or not. While scr.im shows an error page, reCAPTCHA just displays a new CAPTCHA. is has to be recognized so that a localized error message can be shown (see Figure 4.16(a)). If the allenge is solved correctly the email address has to be extracted, presented to the user and the dialog can be closed. is is illustrated in Listing 4.21.

⁷Document Object Model 4.4 Implementation of Spam Templates 81

4.4 Implementation of Spam Templates

In this Section the implementation of the client side part of the Spam Templates approa is dis- cussed. e implementation is similar to the one of Mail-Shake. It follows its ideas and shares as mu code as possible and useful. e aspects whi were already discussed in Section 4.3 are of course not discussed in this Section. is includes the complete aspect of the Akonadi Agent and the communication with the Akonadi framework. e main aspect of the discussion in this Section is the testing if a given mail is junk based on the template. Another aspect is feting new templates from the RSS feed and generating the RSS feed with a small application.

4.4.1 Generating the RSS Feed

Although it is not required for the client side implementation of Spam templates, a generator for an RSS feed of templates has also been implemented. e idea is that the client can fet new templates by downloading an RSS feed. is gives the client an easy way to test if there are new templates. On the server side it requires that if a new template has been generated it has to be added to the feed. erefore a small command line application is implemented whi can be invoked directly from the template generation process. e tool accepts one or several template files as arguments and generates a new feed or aaes the templates to the existing feed. All parameters of the tool are listed in Table 4.6. e content of ea template file is read and a unique hash value is generated using Qt’s qHash() function. is value is used as the global unique identifier for the entry in the feed and for the title. e hash value ensures that ea template is only appended once to the feed. It can be used by the clients, whi retrieve the feed, to identify new templates. e code section, whi is illustrated in Listing 4.22, reads one template file, generates the hash value and constructs one item for the RSS document. All mandatory elements are set including the link element. As the code illustrates this is set to an invalid value. e element should point to the element whi is included in the feed item. is implies that ea template has to be stored on the

Option Usage --help Shows the help --dir= Generate RSS for all files in specified directory --templatefile= Generate RSS for specified template file --outputfile= RSS file, default templates. --rss= Append to the RSS file instead of generating a new one

Table 4.6: Command line options for the RSS generation tool 82 4 Development of the Systems web server together with the RSS feed and that the correct value has to be set. But as the web server and the template generation process does not yet exist, this cannot be set to a valid link. at code section has to be adjusted when the server component is finished and ready for productive usage. e template itself is encoded in the description tag. const QByteArray content = file.readAll(); file.close(); const QString hash = QString::number(qHash(content)); if (containsGUID(hash)) { return true; } QString xml; QXmlStreamWriter stream(&xml); stream.setAutoFormatting(true); stream.writeStartElement("item"); stream.writeTextElement("title", hash); stream.writeTextElement("link", "http://foo.bar.baz/"); stream.writeTextElement("description", content); stream.writeTextElement("guid", hash); stream.writeEndElement(); // item

Listing 4.22: Generating an RSS item from one template file

e complete generated RSS feed contains some additional header elements as well. An example for a generated feed with one template is presented in Listing 4.23. ere is another link element whi should point to the base address of the feed. e value of that element is invalid. e complete source code of this small tool is provided in Appendix G. Spam Templates http://foobar.baz/ RSS feed containing the Spam Templates for Spam filtering Wed, 25 Nov 2009 10:12:37 UTC 152154794 http://foo.bar.baz/ Subject\:\ ([\/\@\$\_\=\|\-\,\!\:\s\w]){2,112}\.([\.\,\=\$\;\?\-\:\s\w]){0,91}\ Body\:\ ([\,\;\&\#\=\!\s\w]){1,1296}\ ([\,\.\;\(\)\?\-\:\’\&\#\s\w]){4,69}\ \=\ 152154794

Listing 4.23: Generated RSS feed containing one template 4.4 Implementation of Spam Templates 83

Figure 4.17: Configuration for determining the mating score

4.4.2 Testing a Mail

e most important part of Spam Templates is of course to test if a mail mates one of the templates. Ea template consists of several regular expressions and ea regular expression represents one line of a mail. Additionally there are regular expressions for subject and the X-Mailer header. is mating approa requires quite some resources. Ea line of the mail has to be mated against the corresponding regular expression of the template. is has to be continued till one template identifies the mail as junk or it has been mated against all templates. In the consequence with ea new template the time to test if the mail is junk increases. erefore it might be an idea to withdraw obsoleted templates. If the spam bots start to use old templates again it can be published as a new one. e mating algorithm determines a score for the tested mail between 0 (no mat) and 100 (com- plete mat). is score can be used to do fuzzy mates, so that also mails not mated completely are classified as junk. e following values are used to construct the score: • Subject

• X-Mailer

• Empty lines

• Mated lines

• Sequence of mated lines/Number of lines in mail

• Sequence of mated lines/Number of lines in template. As illustrated in Figure 4.17 the user can define how mu of the score is determined by ea of these factors. So for example the user could ange the mating in a way that only the number of mated lines is used ignoring all other mating values. 84 4 Development of the Systems

While the quantification for subject and X-Mailer are rather obvious the need for the other factors has to be explained. e idea behind empty lines is to completely ignore empty lines while mating a mail. A legitimate mail might have several empty lines to structure the text. So it is possible that a mail of ten lines has three empty lines. If there is a template with empty lines at the exact same positions the number of mating lines increases. In this example 30 % of the mail is mating and that could result in a false positive. Completely removing the empty lines is no option because it might be another indicator if a spam mail is mating a given template. In case the first half of a template is a perfect mat, while the second half is not, the mail is perhaps not identified as junk. But if in the first half of the template the number of empty lines mates the number of the empty lines in the mail, it is more likely that the mail is in fact junk. erefore the number of empty lines in the mail is compared to the number of empty lines in the template. e next factor in the list given above is probably the most important one: the number of lines in the email mating those in the template. e mating lines might be in arbitrary order. So it is possible that the first line of the mail mates the last line of the template and the last line of the mail mates the first line of the template. e algorithm tries to mat as many lines as possible, whi motivates the next two items: the maximum sequence of mated lines in a row. Although the algorithm tries to mat as many lines as possible, it also tries to mat as many subsequent lines as possible. Only if the next line does not mat the algorithm starts to skip lines and if none of the following lines of the templates mates the current line of the email it falls ba to the beginning of the template. So it is possible that although the number of mating lines is identical to the number of lines of the mail and the template they do not mat at all. is problem is circumvent by using the weight of the sequence of mated lines. A perfect mat is a mat of all lines in sequence. e algorithm needs to compare the sequence of mated lines to both the number of lines in the mail and the number of lines in the template. In case that the sequence is identical to the number of lines in the mail, but the template has more lines, there is not an identical mat and the mail might not be a junk mail. e same is true for the case that the sequence mates the number of lines in the template, but the mail has more lines. is gives another condition to the perfect mat: it is a mat of all lines in sequence with the number of lines in the template and the mail being identical. Of course number of empty lines, subject and mailer should mat, too. e algorithm as presented so far is provided in Listing 4.24 (the complete algorithm can be found in Appendix H). e for-ea loop iterates over all lines of the mail body and uses an iterator to index one line of the spam template. At the beginning both point to the first line of mail and template. When the current line of the mail is empty, the line is skipped and the counter for empty lines is incremented. If it is not empty it has to be compared to the regular expression identified by the current line of the template. In case the regular expression mates the line, the number of mates and the sequence of mates is incremented and the iterator is moved to the next regular expression. As it is possible that the iterator is moved to the end of the template, this has to be eed and in 4.4 Implementation of Spam Templates 85 that case the iterator is reset to the first regular expression of the template. Of course the sequence breaks and is reset. As there was a mat the processing of the current line can be stopped and the algorithm can continue with the next mail line. // try to match as many lines as possible in one go std::list::const_iterator it = d->body.begin(); BOOST_FOREACH (const std::string &line, mail.body()) { if (line.empty()) { // skip empty lines ++emptyLineCounter; continue; } ++lineCount; // does the current iterator match the line? if (boost::regex_match(line, *it)) { // matches ++matchCount; ++currentMaxMatches; maxMatches = max(maxMatches, currentMaxMatches); ++it; if (it == d->body.end()) { // back to begin of template it = d->body.begin(); currentMaxMatches = 0; } continue; } // so far no match - reset to beginning currentMaxMatches = 0; it = d->body.begin(); while (it != d->body.end()) { if (boost::regex_match(line, *it)) { ++matchCount; ++currentMaxMatches; maxMatches = max(maxMatches, currentMaxMatches); ++it; if (it == d->body.end()) { // back to begin of template it = d->body.begin(); currentMaxMatches = 0; } break; } ++it; } if (it == d->body.end()) { // no line matched - reset it = d->body.begin(); } }

Listing 4.24: Algorithm for mating a mail body against a template

In case the regular expression does not mat the current line the algorithm tries to find a different 86 4 Development of the Systems regular expression that mates the current line. erefore the presented algorithm resets the iterator to the beginning (and consequently the sequence of mated lines) and starts to loop over all regular expressions to find a mat. is implies that in the worst case of no mating, ea line is mated against ea regular expression of the template resulting in a worst case complexity of O(n · m) ≈ O(n²) without considering the complexity of regular expression mating. e best case scenario has a linear complexity as ea line of the mail is mated against exactly one regular expression. Altogether the complexity of testing if a mail is spam based on the Spam templates approa is of cubic complexity as a valid mail has to be tested against all templates. // doesn’t match - try forward seek std::list::const_iterator tit = it; ++tit; bool foundMatch = false; for (int i=0; ibody.end(); ++tit) { if (boost::regex_match(line, *tit)) { // matches foundMatch = true; break; } } if (!foundMatch) { // did not find a match in forward seek - try backward seek tit = it; for (int i=0; ibody.begin(); --tit) { if (boost::regex_match(line, *tit)) { // matches foundMatch = true; break; } } } if (foundMatch) { ++matchCount; ++currentMaxMatches; maxMatches = max(maxMatches, currentMaxMatches); ++it; if (it == d->body.end()) { // back to begin of template it = d->body.begin(); currentMaxMatches = 0; } continue; }

Listing 4.25: Forward and baward sear for a mating line based on fuzziness

ere is a small difference between the presented algorithm and the discussion before. When a line does not mat the current line in the template the algorithm goes ba to the first line of the template. In fact a part of the algorithm has been omied. e algorithm tries to mat swapped lines as well. In case that the spam mail swaps lines or omits lines present in the template, the 4.4 Implementation of Spam Templates 87 maximum sequence of mated lines becomes smaller. is might result in false negatives. erefore a configurable fuzziness is introduced. So when a line of the mail body does not mat the current line in the template, the algorithm does a forward and baward seek upon the lines of the template to find a mating line. If su a line is found, it is counted as a line in sequence, although it is not the next line. e fuzzy mating is illustrated in Listing 4.25, the variable lineSeek is based on the configurable fuzzy factor. If the user selects no fuzzy mating it is 0 if completely fuzzy is selected it is the number of lines in the template. So with a completely fuzzy mating strategy the algorithm first seeks till the end of the template to find a mat, continued with going ba to the beginning. Finally the gathered information is used to determine the score by multiplying the configurable factors with the data like the number of mated lines compared to the number of line. is results in a number in the range between 0 (no mat) and 100 (perfect mat). In case this score is greater than a defined threshold the mail is classified as junk and further tests against other templates do not have to be performed. e mail can then be deleted or moved to a special mail folder, just like filtered mails in Mail-Shake. e configuration interface of the Spam Templates Agent allows to ange the quantifiers for the mating algorithm. To help the user there is a test functionality, whi shows how many mails are classified as junk with the current seings. is is also a helpful feature to find the best predefined values when the server component becomes available and new templates are generated.

4.4.3 Summary

In this Section the implementation of the Spam Templates approa has been discussed. First of all publishing the templates as an RSS feed has been shown followed by a detailed discussion on the algorithm whi mates a mail against a template. e algorithm tries to find as many mating lines as possible and generates a score based on the number of mating lines, sequence of mating lines, subject, X-Mailer and the comparison of empty lines in mail and template. ere is also a fuzzy part whi tries to find a mating line by forward or baward seek on the lines of the template.

89

5 Evaluation

In this Chapter an evaluation of the implemented soware is discussed. Especially for Mail-Shake it is important to know how possible users are able to work with the system and if the concept is able to eliminate spam as expected. It would be nice to evaluate if the Spam Templates are able to recognize spam mails even if the mail is not recognized by a rule based system like Spam Assassin. Unfortunately this cannot be evaluated as it would require to have a running server component whi generates new templates and publishes them as an RSS feed. Su an environment does currently not exist, so this cannot be evaluated. Of course there is the possibility to test the Spam Templates implementation with a set of prepared templates on existing mails. is shows if the implementation works as expected but does not say anything about the usefulness in a productive environment. Nevertheless it is possible to use the Spam Templates Agent to monitor for incoming mails. Given the outdated templates it is expected that no junk mails are found. e used Spam Assassin would recognize the mails based on old templates and by that modify the mail in su a way that the templates do not mat any more. Given these current limitations no evaluation for Spam Templates can be provided. e evaluation of Mail-Shake concentrates on the concept to figure out if Mail-Shake is able to reduce the amount of junk emails in a beer way than existing systems. It is important to know if Mail-Shake generates false positives or negatives. Another important aspect for the evaluation is to test how the average user reacts to mails sent by Mail-Shake. Will they follow the steps to authenticate or will they not resend the mail? is is actually quite difficult to test. Sending out mails and waiting for a reply does not work as the receiver of a mail is set on the temporary whitelist. Asking people to test it by sending mails does not work because they would be made aware of the test and by that react in a different way.

5.1 Mail-Shake Evaluation Setup

Mail-Shake has been used in a productive environment on an IMAP account beginning on 31st of December 2009. e results presented here are generated over a period of six weeks till the 15th of 90 5 Evaluation

Private Address Public Address privat@ public@ studium@ studium-public@ internet@ internet-public@ ubuntu@ ubuntu-public@ kde@ kde-public@

Table 5.1: Private and public addresses used during the Mail-Shake evaluation. ¹

February 2010. As a special situation the used IMAP account gathers mails for different addresses and the mails are filtered into sub-folders by a server side Sieve script based on either the receiver address or some email headers like a mailinglist identifier. All emails received in this IMAP account have been processed by Spam Assassin and based on the result all spam mails are moved into a junk sub-folder. As well greylisting is used to reduce the number of spam mails. For the test period greylisting has been turned off and the Sieve script was anged so that the spam mails are not moved to the junk folder directly. e script filters the mails into an own folder for ea used pair of private and public address. e existing rules for mailinglists are still in use, so that those mails do not end up in one of the folders monitored by Mail-Shake. Mails with an incorrect or missing “To” or “CC” header are not filtered and are moved to the junk folder in case su a mail is recognized as junk by Spam Assassin. As a maer of fact all existing email addresses have to become the private address. Normally the whitelist would have to be filled with existing addresses. For this evaluation it is important to see how people react on notification mails. As it is possible that they won’t react the mails may not be deleted but moved to a folder and this one has to be monitored manually. e number of private mails sent to these addresses shown in Table 5.1 is rather small. Most mails are in fact sent by mailinglists or automated systems and do not appear in the evaluated folders. e addresses used for KDE development are part of the evaluation in the hope that users send mails as the KDE Soware Compilation in version 4.4 has been released during the evaluation period.

5.2 Results of Mail-Shake Evaluation

One of the first results of the evaluation is rather unpleasant. Although the receiving MTA² is configured to not accept mails sent from the domains in its responsibility by enforcing a sender access restriction, some of the spam mails received have a “From” header with the recipient’s address

¹All addresses use domain “martin-graesslin.com” ²Postfix server version 2.5.4 5.2 Results of Mail-Shake Evaluation 91 set. is is only possible if the spam sender uses a different address in the “Mail From” section of the envelope when talking to the MTA. is could be verified by the log file. e headers of the mail are unfortunately not logged. For Mail-Shake this unexpected behavior of spam mails is a real problem. Of course the mail is not whitelisted and dropped but a notification is sent. is notification is sent to the sender whi in fact is the private address. So the notification ends up as a not whitelisted mail in the private collection causing another notification and so on and on. Mail-Shake is stu in a mail sending loop till the user intercepts and marks one mail as read or deletes it. In the first month of the evaluation 38.7 % (98 mails) of the junk mails filtered by Mail-Shake and received for the “internet” address used the recipient address as the sender address. To circumvent this mail loop the Mail-Shake implementation has been modified to not send no- tifications to it’s own used addresses. Nevertheless this does not really fix the problem. If the user adds his own address to the whitelist the spam mails are not removed and marked as whitelisted. is means that a user is not allowed to add his own address to the whitelist and cannot send mails to his own address, whi is a severe limitation caused by the usage of Mail-Shake. If the sender address of a junk mail is not the own address it is in general a faked address. e notification cannot be delivered and as expected a Delivery Status Notification is returned. Although the RFC does not enforce to include the original mail, the praxis shows that it is almost always included. By that it can be eed if the original mail is a Mail-Shake notification mail and the DSN can be dropped in that case automatically. During the evaluation delivery failure notifications were received, whi do not comply to RFC 3464. e content-type “multipart/report” is not set and the original mail is not aaed but pasted into the body of the mail. ese notifications are sent by an Exim version 4.69 MTA released in December 2007 and qmail. Unfortunately it seems like Exim is even in the latest release unable to send compliant notifications[22, 64]. An example of su a notification is provided in Appendix A.2. e only way to recognize a delivery notification sent by Exim is to rely on the presence of the header “X-Failed-Recipients” and to sear the content of the complete body for an indication that it is in reply to a Mail-Shake mail. is is of course quite fuzzy, but not eliminating these Exim notifications results in another mail loop. In the scope of this thesis only a rudimentary e for this failing MTA has been implemented whi does not extract the original sent message. e notifications of qmail are even more broken. e only possibility to guess that it is a notification is the subject “failure notice”, whi might be a valid subject. e evaluation showed that Mail-Shake is able to reduce the number of junk mails received by this setup. e risk that spam bots are aware of a Mail-Shake setup and try to solve the allenge is rather low at the current time. e addresses used to send junk emails are either the recipient’s address or a faked one. So the original sender will never see that there is a Mail-Shake setup. In order to successfully send junk mails the spammers would have to ange their infrastructure so 92 5 Evaluation that they can receive the Mail-Shake allenges. But receiving the allenges is only one step: the spammers would have to solve the CAPTCHA and resend the message. Till Mail-Shake is so highly adopted that it would be worthwhile to ange the infrastructure Mail-Shake will protect its users from junk mails. Another important question is the handling of mails received by automated systems. erefore one purase at an web shop was performed. e private address has to be used as nobody will answer the Mail-Shake allenges, so the expected mails have to be put on the whitelist manually. e mails following the registration would have been bloed but the address was added to the whitelist. So the expectation is that further communication is possible. e email containing the bill was sent from a completely different domain and was by that not whitelisted. e assumption that it is sufficient to whitelist one domain in order to communicate with automated systems does not hold. During the evaluation another false positive mail was received from a social network service³. e private mail was used to send an invitation to join this social network and as the address is not on the whitelist the mail was removed. In this case the shown notification did not help to manually whitelist the mail as many junk mails are camouflaged as mails from this specific social network. Su a situation where a mail is sent from a perhaps whitelisted contact via a not whitelisted third party service cannot work with Mail-Shake. If the private address is specified the mail is dropped if the public address is specified the third party service does not solve the allenge as it is an automated system. So it is impossible to send a mail to a Mail-Shake protected account via a third party service. is is a serious flaw in conjunction with all social networking services. e most important question “Will Joe User be able to handle Mail-Shake requests?” can of course not be answered in the scope of this evaluation. To answer this question a usability study would have to be set up whi could be a topic of an own thesis. Even if a usability study would be used it is difficult to answer this question. e users participating in the study would probably react in a different way to the received allenges. In two cases the receivers of a notification did solve the CAPTCHA and sent a mail to the public domain to generate an unique id, but they did not resend a mail containing the id to the private address. One of these senders did know about the evaluation and about how Mail-Shake works.

5.3 Greylisting

e infrastructure used during the evaluation period was protected by a Greylisting service to limit the number of received junk mails. During the evaluation Greylisting was disabled for the par- ticipating email addresses. is has the advantage that all spam mails are accepted and passed to

³Facebook 5.3 Greylisting 93

Figure 5.1: Rejected mails in December 2009 on the evaluated MTA.

Mail-Shake as well as the response time is improved as the MTA does not have to wait till it is allowed to send the mails. e nice side effect is that this ange in the infrastructure allows to evaluate the usefulness of Greylisting as well. Figure 5.1 illustrates the number of rejected mails and mails recognized as spam by the Spam Assassin installation in December 2009 before the beginning of the evaluation. e number of rejected mails is not equal to the number of mails bloed by Greylisting. ere are of course legitimate mails whi are delayed by Greylisting as well as mails whi are rejected due to the fact that they try to use the MTA as an Open Relay. e graph illustrates another interesting fact: in calender week 52 around Christmas the number of rejected and spam mails is significantly lower than in the weeks before and aer. e system does not scan incoming mails for aaed malware whi is the reason why the number of Viruses is null. For the next month, January 2010, the expected result is a drop in the number of rejected mails while the number of spam mails should increase. But the Mail-Shake implementation is responsible for some of the rejected mails as delivery status notifications are not parsed correctly by the under- lying libraries. Because of that notification were tried to be sent to the mailer daemon of the MTA whi does not exist in the virtual alias table causing the notifications to be rejected. In Figure 5.2 the stats for January are illustrated. As expected the number of recognized spam mails is with 386 significantly higher than in the previous month with only 148 recognized spam mails. is is an increase by the factor of 2.6. On the other hand the number of rejected mails dropped from 1023 to 426 in January by a nearly equal factor of 2.4. Surprisingly the absolute number of rejected mails dropped by 597 while the number of spam mails increased only by 238. is could be explained if the spam mails would retry to send the mails if it is greylisted. But in that case the number of spam mails would not have increased so mu. is can only be explained if legitimate mails are rejected, too. But as mentioned above the number of private mails received is rather small. Most of the mails are sent by automated systems like mailing lists or bug traers 94 5 Evaluation

Figure 5.2: Rejected, bounced and junk mails in January 2010 on the evaluated MTA.

Address Filtered Mails privat 7 studium 0 internet 340 kde 34 ubuntu 35 Sum 416

Table 5.2: Number of mails filtered by Mail-Shake in January 2010 for the different addresses and those mails should not be greylisted.

5.4 Results from January

As the evaluation started in January and some problems were recognized during the evaluation for the first time, the results are not perfectly accurate. For example due to a bug in the KMime implementation the Mail-Shake Akonadi agent was unable to parse delivery status notifications correctly resulting in increased numbers of rejected mails (see Chapter 4.3.2.3 on page 71). As well the case of junk mails using the recipient’s address as the sender address was not yet considered and caused some mail loops. Nevertheless the gathered results provide a nice summary about the way Mail-Shake handles mails received at the private address. Altogether the Mail-Shake setup removed more than 400 mails during this month. e mail loops are of course not counted as they could be considered as a bug in the implementation. How many mails were received for ea of the address is shown in Table 5.2. As expected the address used for web services received most mails. Surprisingly the address used for the studies has not leaked and no mail was received. Also the address only known to friends seems to be well protected, but there is a mistake in the Sieve script causing junk mails for that address to be moved to the spam folder 5.4 Results from January 95

privat internet kde ubuntu Filtered Mails 7 340 34 35 DSN 2 87 10 8 DSN by own MTA 2 74 8 6 Junk 5 253 24 27 Not recognized by Spam Assassin 2 11 5 1

Table 5.3: Statistics for filtered mail per address in January instead of the folder used by Mail-Shake. e two mail addresses, used for KDE development, were used as recipient for junk mails more oen but not as oen as it would be expected as those addresses are accessible in plain text on various web sites su as KDE’s websvn installation⁴, mailing lists⁵ and bug traer⁶. Although there is a difference in the number of received mails per address, all mails are considered in the further discussion. Nevertheless the exact data per address is provided in Table 5.3. e number of filtered mails includes also the delivery status notifications. As mentioned this number is not perfectly accurate as at the beginning the DSNs were not recognized and moved. Sometimes the Mail-Shake agent crashed (see Chapter 6.1 on page 99) before sending the notification. As it is obvious that su a mail would have been dropped, these mails are included in this statistic. In several cases the MTA retried to send the Mail-Shake notification for several days resulting in additional delayed mail notifications, whi are in fact a variant of delivery status notifications. Altogether 107 delivery status notifications were received - that is approximately one fourth of all received mails. 90 (84 %) of those 107 delivery status notifications were issued by the own sending MTA. e difference of eight mails to the number of bounces in Figure 5.2 is due to uncaught mail loops at the beginning of the evaluation. e fact that every fourth received mail is a delivery status notification is rather unpleasant for the user. If the user activates the notifications to be informed about mails whi would be dropped, the user gets annoyed by those notification. A false positive rate of 25 % in notifications is too high. To fix this issue the Mail-Shake API has to be anged to provide a reason why the mail is dropped so that no notification is shown when a DSN is dropped. Another possible solution is to not send Mail-Shake notifications for mails classified as junk by Spam Assassin. Subtracting the delivery status notifications results in 309 junk mails filtered by Mail-Shake. is number is lower than the one shown in the graph above because junk mails were received for other addresses on that MTA, too. Especially mails without a recipient or an invalid recipient were not filtered into one of the mail folders used by Mail-Shake. As well the mistake in the Sieve script is

⁴http://websvn.kde.org ⁵E.g. KWin mailing list at @kde.org ⁶http://bugs.kde.org 96 5 Evaluation a cause for the discrepancy. Of course one of the reasons for the implementation of Mail-Shake is that it is able to remove junk mails whi are not recognized as su mails by existing tenologies. It is an interesting question how many mails of those 309 junk mails have not been classified as junk by the Spam Assassin installation. In this setup 19 mails filtered by Mail-Shake were not recognized by Spam Assassin. Another 32 unclassified mails whi were not passed to Mail- Shake were marked manually as junk in January. Given these numbers, Mail-Shake is really an improvement as it reduces the need to manually e the mails. If we consider that the task of manually classifying a mail as junk takes half a minute, Mail-Shake is able to reduce the workload of 25 minutes for this mail account⁷ per month. On the other hand Mail-Shake requires the user to e for false positives. But if different te- nologies are combined, this step can be removed. E.g. greylisting can be used to reduce the number of received mails and mails classified by Spam Assassin as junk, whi are not on the whitelist, could suppress the notifications shown to the user. By that the number of shown notifications be- comes acceptable and does not cause further work load. Even manually eing the mails filtered by Mail-Shake becomes acceptable if it does not include DSNs and mails already classified as junk by Spam Assassin.

5.5 Results from February

e results for February are from 1st of February till the 15th of February. e anges mentioned above to not send notifications in response to DSNs are included, unfortunately not working prop- erly. For spam mails the same problem occurs as for DSNs: KMime is unable to parse the mails correctly. Nevertheless this worked correctly in case the spam mail is not a plain text mail. Alto- gether 356 mails were filtered by Mail-Shake during these 15 days, whi is a significant increase compared to January. e data for ea address is presented in Table 5.4 illustrating the same results as in January: most of the mails are received for the address used on web sites, the other addresses do hardly receive junk mails. Although Mail-Shake was anged in a way to receive less DSNs, 119 DSNs were received. 96 (80 %) of those were issued by the own MTA. e rate is approximately the same as in January, so we can assume that most DSNs are issued by the own sending MTA. is is a useful information for the case that the own MTA does not send RFC compliant delivery notifications. So Mail-Shake could be anged in a way that the user specifies the address of the own MTA and Mail-Shake is trained to recognize exactly those mails. at could for example be based on the presence of exactly one “Received” header whi cannot be forged by spammers. Subtracting the DSNs a total of 237 junk mails were filtered. e exact numbers for ea address

⁷In case all mails are passed to Mail-Shake 5.6 Summary 97

Address Filtered Mails privat 38 studium 0 internet 286 kde 23 ubuntu 9 Sum 356

Table 5.4: Number of mails filtered by Mail-Shake in February 2010 for the different addresses are presented in Table 5.5. Among those 237 junk mails, there were 11 mails not classified by Spam Assassin, another 11 unclassified mails have been classified manually. One of those mails was in fact no junk and contained malware (confirmed with CWSandbox[70]). e mail was elaborated very well and probably able to frame users into opening the aament whi executes the malicious soware. is is a very important result of Mail-Shake as it proofs that Mail-Shake is also able to remove mails containing malicious soware. is can help the user in case he does not use anti-virus soware or one with outdated signatures.

privat internet kde ubuntu⁸ Filtered Mails 38 286 23 9 DSN 20 90 9 0 DSN by own MTA 12 77 7 0 Junk 18 196 14 9 Not recognized by Spam Assassin 0 9 0 2

Table 5.5: Statistics for filtered mail per address in February

5.6 Summary

In this Chapter the evaluation of Mail-Shake has been presented. As expected Mail-Shake is able to reduce the number of spam messages delivered to the end-user and is also able to protect against malware distributed via emails. Unfortunately the evaluation has revealed that many mail transfer agents are not capable of sending RFC compliant delivery status notifications. To handle these notifications correctly Mail-Shake has to be adjusted and the user has to decide if he wants to see all or none of the delivery status notifications sent from those mail servers. e assumption that Mail-Shake cannot produce false negatives could be verified during the eval- uation. Only in case that the user sets his own address on the whitelist, Mail-Shake might produce

⁸e Agent monitoring the “ubuntu” address crashed whenever a mail was received and no notification was sent. ese crashes only appeared for this one Agent and seem to be caused by a misconfiguration. 98 5 Evaluation false negatives as spammers use the recipient address as sender address. Unfortunately some false positive results were generated in conjunction with the communication of automated systems. e assumption that a whitelist entry can be constructed from one received mail for an automated sys- tem does not hold. 99

6 Retrospection and Future Tasks

In this Chapter some of the problems whi arose during the writing this thesis are named. Of course there were problems and not everything worked as expected. e soware oice was for some minor points not optimal. Another topic of this Chapter is the “Future Tasks”, whi includes some ideas to improve the two implementations and topics for possible other theses based on the results of this one.

6.1 Problems caused by Akonadi

While Akonadi in general was the perfect oice for implementing the two applications as it allows having an email client independent solution, there are some disadvantages. Most of them are caused by the fact that Akonadi is still under heavy development and some functionalities Mail-Shake and the Spam Templates rely on had not yet seen a stable release at the time of the development. Of course before the decision to use KDE Akonadi was made, it was known that Akonadi is a rather new tenology and that many parts are still a working issue. But it was also known that KDAB planned to provide a proof of concept port of KMail based on Akonadi at the end of the year 2009. With the release of the KDE Soware Compilation version 4.5 in summer 2010 KMail will be available as a stable release based on Akonadi. So at that time all requirements for the agents will be fulfilled and working properly. e problems whi occurred during development will not be a problem for the productive usage. A problem during the evaluation (see Chapter 5.2 on page 90) was the rather unfinished state of the Akonadi IMAP resource. is resource was wrien for the release of the KDE Soware Compilation 4.4, so a development version was used. In general the resource works fine, but it had problems with the large number of 93 folders in various sub trees with several thousands of mails. e disk usage of the IMAP account is nearly 1 GB and the biggest single folder has a size of 184 MB. Most of the problems were in fact already solved at the time of the release of KDE SC 4.4 at the beginning of February 2010. One of the first problems was that Akonadi did not notify Mail-Shake about new mails in the newly created sub folders for the public addresses. In this case it seemed to be the problem that all folders were part of a sub tree. Aer recreating the folders as toplevel items the problem no longer 100 6 Retrospection and Future Tasks occurred.

Another problem occurred when one of the agents was about to move a mail if the same folder was opened in KMail, too. In that case the mail was only copied to the destination folder. Nev- ertheless Mail-Shake worked correctly: the status notifier was activated and notification emails were sent. is could be easily verified by the received Delivery Status Notifications. Mail-Shake is implemented to silently fail in su cases and not to retry or to report an error. Akonadi of- fers a debugging console whi was used during the development. In the productive environment the debugging console was unfortunately hardly usable, so that there was no way to gather debug information. It would have been useful to have a complete log of the actions performed by the Mail-Shake agent. However, in a final state this is unneeded and the assumption that all errors can silently be ignored is still correct. e mentioned problem of KMail bloing the deletion of mails will of course be solved by the port to Akonadi. As KMail will only be another view to the same mail data provided by Akonadi and will not use its own IMAP connection.

Another problem in conjunction with KMail and Akonadi is that if Akonadi deletes a mail it is kept in the same folder marked with the flag “\Deleted”, but the folder is not purged. e mail is not visible in KMail anymore but the number of unread mails is unanged. So the number of unread mails is continuously increasing. is is in fact a problem occurring in KMail itself if a mail is moved to another folder and the folder is anged while moving. is quite annoying bug of KMail will probably be solved with the port to Akonadi as well as it will gain the feature to purge all deleted mails from a folder.

Sometimes aer resuming the system from suspend state, the IMAP resource was not working anymore. In su a situation it was required to toggle the online/offline state of the resource once. A similar problem occurs also with the KMail IMAP implementation. In fact the behavior is easy to understand as the connection to the server times out. An easy solution would be to automatically reconnect aer a suspend.

e KMime Library is unable to parse Delivery Status Notifications properly. While Mail-Shake has its own implementation to circumvent this problem, it did not solve it completely. Several times the agent crashed when processing a DSN. Unfortunately the statrace of the crash was hardly useful. e topmost calling methods are missing nevertheless it shows that it is crashing in the underlying libraries and not in the agent¹. For the process of Mail-Shake this crash is not a problem. e agent is automatically restarted and no notification would have been sent in response to the DSN anyway. So the only disadvantage of this crash is that the DSN is not moved or deleted. e crash did not occur for ea DSN though.

¹Because of the incompleteness of the statrace the crash was not reported to the developers 6.2 Future tasks for Spam Templates 101

6.2 Future tasks for Spam Templates

For the Spam Templates implementation to be fully functional, a server component is required. is component has to execute the spam bots in a sandbox in an automated way. e templates have to be generated and published on a public web server as an RSS feed. Without this public RSS feed the implemented agent cannot be used as there is no way to update the list of used templates. Due to this missing server side component it was difficult to test the implementation in a pro- ductive environment and so it is possible that the productive usage will reveal unknown problems, whi will require anging the code, e.g. it might be required to publish lists of obsoleted Spam templates in order to reduce the number of templates used to test incoming mails. It is still unknown if the oice for RSS to publish new templates is the best one. is part can also only be evaluated aer the setup of the server side component. e KDE RSS Feed Reader Akregator is currently being ported to Akonadi including a new library for handling RSS feeds. is will most likely deprecate the old and currently used library. At the time the server component could be available, it is likely that the library is available and the Akonadi Spam Template Agent should be ported to this library in order to have a more coherent source tree completely based on Akonadi. Another topic for the future would be to provide extensions for various email clients (e.g. Mozilla underbird) based on the small abstracted library. e advantage of su an extension is that it does not require a running Akonadi installation. While Akonadi is a runtime requirement for KDE Plasma it is not (yet) for other free or proprietary workspaces. Of course su an extension is only useful if the server component exists. erefore the library has to become a shared object – currently it is built as a static library – and has to provide a stable Application Binary Interface (ABI). is will of course require an API review to ensure that it will work with other applications as well. e Spam Templates concept might also be a nice addition to existing rule-based spam filtering systems su as Spam Assassin. is would allow doing the initial e directly in the MTA and could be used in conjunction with other rule-based results to identify the mail as junk more reliable, e.g. if the IP address is on a current blalist and the mail body mates a template, the mail is with a very high probability junk and could directly be rejected. Of course integration into Spam Assassin on the server only allows an initial e and not the useful successive es on the client. e client implementation could be anged in su a setup to not e incoming mails as those have already been eed.

6.3 Future Tasks for Mail-Shake

As opposed to the Spam Templates implementation the evaluation in Chapter 5 on page 89 illustrated that Mail-Shake is already fully functional and usable. Nevertheless Mail-Shake can still be improved 102 6 Retrospection and Future Tasks and worked on.

6.3.1 Handling of Delivery Status Notifications

One of the most important tasks is to find a solution to the problem of Delivery Status Notifications because they provide a way to aa Mail-Shake. Mail-Shake itself can hardly be improved on this topic, except providing the user the oice to drop all DSNs automatically, whi also removes correct DSNs or to use the current implementation whi allows spam disguised as a DSN to pass the Mail-Shake test. is issue can only be solved in the email client implementations. e client knows the unique Id of all sent messages and can use that to e the original mail for this id. If the id mates one of the sent messages the DSN is valid. Otherwise the DSN can be discarded. As aaing the original message is an optional feature, spammers could work around this protection by not including the original mail and posting their junk in the message directly. Su a DSN could neither be recognized by Mail-Shake nor by the email client. e solution to su a spam aa is the Spam Templates approa, but instead of mating if the mail is junk it has to mat the mail for being a DSN. erefore the templates must contain the content of DSNs sent by various MTAs. An infrastructure for updating the templates is required as it is possible that new versions of MTAs ange the layout of the DSN.

6.3.2 Mail-Shake for Several Addresses

During the evaluation Mail-Shake has been used to protect several addresses. is requires running one agent for ea address. is approa was osen to simplify the already complex user interface and to keep the source code clean. A new mail may either be in a private, a public or in the sent mails collection. is results in a clean and straightforward coding path. If the agent handles several addresses the code has to e if the mail belongs to one or the other addresses. In that case it is even possible that the same mail belongs to a public collection for one address and to a private collection for another address. Handling su situations could result in undefined behavior when the email is deleted due to being sent to the private address. Of course su a misconfiguration can also happen for several agents but the execution can never rea an undefined state as the mails for different addresses are processed in parallel and not sequentially. e disadvantage of the approa of having one process per monitored address is illustrated in Figure 6.1. Ea agent has an own icon resident in the systray of the desktop shell. While those icons are hidden by default and only visible if a mail to the private address will be dropped, it is difficult to find the correct item for configuration. Only the tooltip provides a way to distinguish the different agents. ose items should be merged into one item allowing access to the configuration 6.3 Future Tasks for Mail-Shake 103

Figure 6.1: Mail-Shake Agents in the systray of all agents in a useful way. Unfortunately it is not possible to access the same notification item from various processes. A possible solution is to provide a helper application whi is resident in the systray and the agents register themselves to this helper item via an interprocess communication protocol su as D-Bus.

6.3.3 Solving Mail-Shake Challenges in Email Clients

Users should not cli on unknown hyperlinks in emails. Because of that including the CAPTCHA links in Mail-Shake allenges is not the best solution. It would be beer to integrate solving the allenge in the email client as it has been done for Mailody in this thesis. Su an integration is required for more email clients to secure the process of Mail-Shake and not to open the door for aaers. For the upcoming version 2 of KMail this task is rather trivial as the code wrien for Mailody can be reused. is task is more difficult for clients su as Mozilla underbird or Microso Outlook. Providing an extension is not sufficient as this would only add the functionality to those using the extensions and by that are already aware of Mail-Shake. In the case of Mozilla underbird and other open source clients the code can be wrien and proposed as a pat for a new release. But it is likely that the ange would be rejected as Mail- Shake is unknown and the developers do not want to maintain code for a very small user base. is is a ien and egg situation: for a high adoption of Mail-Shake there needs to be client integration – only an oen used concept will be integrated into the clients. In case of proprietary clients su as Microso Outlook or Apple Mail there is no ance to provide pates and by that the integration for solving Mail-Shake allenges can only be provided by the vendors.

6.3.4 Integrating Mail-Shake Directly Into Email Clients

Although the Mail-Shake Akonadi Agent can be used in conjunction with email clients not support- ing Akonadi it is possible and useful to provide an extension for other clients. is is a similar task to the one discussed for Spam Templates: there is the client independent library and based on this 104 6 Retrospection and Future Tasks library Mail-Shake could be integrated into different clients. e advantage for the user is that he does not have to use an Akonadi setup. As the library is very lightweight and does not include a MIME parser most of the “work” has to be done in the client. at is translating the mails into the format required by Mail-Shake passing mails to the implementation and sending out the notifications. Also managing the whitelist and Id storage is part of the client implementation. As the Akonadi agent uses a SQLite database this might be moved into the library, but that would remove the oice for the storage format from the client. If a client is already using a database it might be beer to reuse it instead of adding the dependency to SQLite. And in fact compared to the complexity of the user interface or passing the mails to Mail-Shake handling the storage is a minor task. e implementation of the Akonadi agent illustrated that the most difficult task for the client implementation is the user interface. It has to be easy to use and should provide possibilities like generating the CAPTCHA URL, managing the whitelist, adding addresses based on contacts etc. Approximately 60 % of the C++ code wrien for the client implementation is in fact for the con- figuration interface and this does not include the actual interfaces, whi are wrien in a markup language. For ea email client a new configuration interface has to be wrien and the ances for code sharing are rather limited as they use different toolkits and the configuration interface wrien in the scope of this thesis relies on Akonadi.

6.4 CAPTCHA Security

Verifying the security of CAPTCHAs based on the application shown in Chapter 3.2 is another possible future task. e application is currently only able to test CAPTCHAs whi use a noisy baground. e idea is to provide a testing suite for various common mistakes in the design of CAPTCHA systems. With minimal adjustments to the OpenGL shader an adopted application is already able to solve various different CAPTCHAs with the same insecure approa of adding a noisy baground. By providing a utility to verify that a CAPTCHA is insecure there is the ance that more secure systems like reCAPTCHA are preferred over their own implementations. On the other hand su a tool might result in the false assumption that a CAPTCHA is secure because the tool is unable to solve it. In addition, su a generic application might be used for geing illegal access to protected content whi is of course not the aim of su a tool. On the other hand if reCAPTCHA becomes the only used CAPTCHA system, it becomes more aractive for aas. If the system is broken many websites are broken including the Mail-Shake allenge. erefore it would be useful to have an alternative secure web based CAPTCHA system. Developing su a system for the usage in Mail-Shake might be an interesting task. 105

7 Conclusion

Drawing a conclusion for the results of this thesis is not an easy task. While the implementation of Mail-Shake is functional and helps to reduce the number of received junk mails as expected, there are some previously unknown problems that might render Mail-Shake unusable. e most prominent problem is the handling of delivery status notifications. Mail-Shake is capable of handling those correctly if the delivery status notification complies to the RFC. Unfortunately the evaluation showed that there are Mail Transfer Agents whi do not comply, including the one used by Google Mail. e evaluation showed as well that most of the delivery status notifications are sent by the outgoing mail server. In combination that means that users of Google Mail are unable to use Mail- Shake. It is a disappointment to see that the implementation fails not due to a misconception in the design or bugs in the implementation but due to faults in external components. Another problem whi might hinder the adoption of Mail-Shake is the fact that the false posi- tive rate is higher as expected. e assumption that a user knows that he will receive mails from an automated system does unfortunately not hold. is renders Mail-Shake unusable as the only existing spam fighting tenique because the user has to e the filtered mails for false positives, an unbearable task in case there are many spam mails plus the additionally generated delivery status notifications. Nevertheless Mail-Shake is usable and the failure of various MTAs should not be a reason to not publish the source code. A broader adoption of Mail-Shake might catalyze the development of RFC compliant delivery status notifications, whi would be a benefit for more soware projects whi have the task of identifying bounce mails reliably. Until then the number of received notifications can be reduced by e.g. removing identified junk mails before Mail-Shake processes them. Drawing a conclusion for the Spam Templates implementation is difficult as well. Due to the fact that there are currently no recent templates the implementation could not be tested properly. Nevertheless the tests so far show that it works. As the templates are generated from intercepted spam mails it is impossible that this approa generates false positives and by that an advantage compared to the Mail-Shake approa. Unfortunately, the concept is unusable without the generated templates. For a broader adoption of this concept the server component has to be established first. All together the tasks of this thesis are fulfilled. Both approaes are implemented and work properly. e implementation of Mail-Shake could even be evaluated due to usage in a productive environment during the last two months. e implementation is in a state that the source code can be 106 7 Conclusion published and has been wrien with the need of other projects in mind so that they can easily provide custom implementations of Mail-Shake without the need of reimplementing the whole concept. Although Mail-Shake is working properly there are some issues with the underlying aritecture whi makes the processing of delivery status notifications difficult. erefore, publishing a first stable version should be postponed until the availability of the KDE development platform in version 4.5 in summer 2010 and the required adjustments to the improved API are implemented. Personally, I will not continue the usage of Mail-Shake until the release of KDE SC 4.5 shipping the version of KMail ported to Akonadi. e fact that two email clients are connected to the IMAP server has a footprint on memory usage and there are some minor annoyances like the incorrect number of new mails in KMail. ese problems will be solved when KMail is using Akonadi and from that point of view there is no reason to not use Mail-Shake. Of course there will be a different setup. For example the email account used for KDE development cannot be protected by Mail-Shake due to the fact that mails sent by Review Board cannot be whitelisted. Also the setup will remove mails identified by Spam Assassin and the whitelist will be filled beforehand so that senders do not have to authenticate themselves. 107

Bibliography

[1] L. von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum. “reCAPTCHA: Human-Based Character Recognition via Web Security Measures.” In: Science 321.5895 (2008), p. 1465. [2] Luis von Ahn and Will Cathcart. “Teaing computers to read: Google acquires reCAPTCHA.” In: e Official Google Blog (Sept. 2009). : http://googleblog.blogspot.com/2009/ 09/teaching-computers-to-read-google.html. [3] C. Almer. Spam-Der endlose Kampf gegen unerwünste E-Mails und die daraus entstehenden Kosten und Auswirkungen auf Wirtsa und Gesellsa. GRIN Verlag, 2008. [4] J.P. Bigham and A.C. Cavender. “Evaluating existing audio CAPTCHAs and an interface op- timized for non-visual use.” In: Proceedings of the 27th international conference on Human factors in computing systems. ACM New York, NY, USA. 2009, pp. 1829–1838. [5] Jasmin Blanee and Mark Summerfield. C++ GUI Programming with Qt4 (Prentice Hall Open Source Soware Development). 2nd Revised edition (REV). Prentice Hall International, 2008. : 0132354160. : http://www.amazon.com/exec/obidos/redirect?tag= citeulike07-20&path=ASIN/0132354160. [6] M. Chew and J.D. Tygar. “Image recognition captas.” In: Lecture notes in computer science (2004), pp. 268–279. [7] CodeSourcery, Compaq, EDG, HP, IBM, Intel, Red Hat, and SGI. C++ ABI for Itanium (Re- vision: 1.86). [Online accessed 22nd January 2010]. : http://www.codesourcery.com/ public/cxx-abi/abi.html. [8] Commtou. 2006 Spam Trends Report: Year of the Zombies. 2006. : http : / / www . commtouch.com/downloads/Commtouch_2006_Spam_Trends_Year_of_the_Zombies. pdf. [9] Commtou. Q2 2008 Internet reats Trend Report. [Online accessed 26th October 2009]. 2008. : http : / / www . pandasecurity . com / emailhtml / oxygen / Q2 _ 08Email _ Threats-Panda.pdf. [10] Commtou. Q2 2009 Internet reats Trend Report. [Online accessed 26th October 2009]. 2009. : http://www.commtouch.com/download/1491. [11] Nokia Cooperation. Why Doesn’t Qt Use Templates for Signals and Slots? [Online accessed 9th February 2010]. 2009. : http://qt.nokia.com/doc/4.6/templates.html. [12] James Coplien. “C++ idioms.” In: Proceedings of the ird European Conference on Paern Languages of Programming and Computing. Citeseer. 1998. [13] Microso Corporation. read Safety in the Standard C++ Library. [Online accessed 8th February 2010]. 2009. : http://msdn.microsoft.com/en-us/library/c9ceah3b. aspx. 108 Bibliography

[14] M. Crispin. “Internet Message Access Protocol - Version 4rev1.” In: RFC Editor United States (2003). [15] Frédéric Dahl. “Der Storm-Worm.” Diplomarbeit. Universität Mannheim, 2008. [16] Dano Danev. Inside India’s CAPTCHA solving economy. 2008. : http://blogs. zdnet.com/security/?p=1835. [17] J. De Bruijn, R. Lara, A. Polleres, and D. Fensel. “OWL DL vs. OWL Flight: Conceptual model- ing and reasoning for the semantic web.” In: Proceedings of the 14th international conference on World Wide Web. ACM. 2005, p. 632. [18] J. Elson, J.R. Douceur, J. Howell, and J. Saul. “Asirra: a CAPTCHA that exploits interest- aligned manual image categorization.” In: CCS ’07: Proceedings of the 14th ACM conference on Computer and communications security. New York, NY, USA: ACM, 2007, pp. 366–374. : 978-1-59593-703-2. : http://doi.acm.org/10.1145/1315245.1315291. [19] M. Engelberth, J. Göbel, C. Gorei, and P. Trinius. “Mail-Shake.” In: DEXA’09. 20th Interna- tional Workshop on Database and Expert Systems Application. 2009, pp. 43–47. : http: //ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5337521. [20] Mahias Eri, iago Macieira, and Joseph Gaffney. Binary Compatibility Issues With C++. [Online accessed 22nd January 2010]. : http://techbase.kde.org/Policies/ Binary_Compatibility_Issues_With_C++. [21] Chaos Computer Club e.V. SCHNUCKI project analysis. [Online accessed 7th October 2009]. 2007. : http://www.0x11.net/schnucki/. [22] EximWiki. Exim Frequently Asked estions. [Online accessed 8th January 2010]. : http: //wiki.exim.org/FAQ/Delivery/Q0607. [23] Eri Gamma, Riard Helm, Ralph Johnson, and John Vlissides. Design Paerns: Elements of Reusable Object-Oriented Soware. illustrated edition. Addison-Wesley Professional, 1994. : 0201633612. : http : / / www . amazon . com / exec / obidos / redirect ? tag = citeulike07-20&path=ASIN/0201633612. [24] P. Golle. “Maine learning aas against the Asirra CAPTCHA.” In: Proceedings of the 15th ACM conference on Computer and communications security. ACM New York, NY, USA. 2008, pp. 535–542. [25] Jan Göbel, orsten Holz, and Philipp Trinius. “Towards Proactive Spam Filtering (Extended Abstract).” In: Proceedings of the 6th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer. 2009, p. 47. [26] Evan Harris. e next step in the spam control war: Greylisting. 2003. : http : / / projects.puremagic.com/greylisting/whitepaper.html. [27] Frank Mori Hess and Douglas Gregor. Boost.Signals2. [Online accessed 8th February 2010]. 2009. : http://www.boost.org/doc/libs/1_39_0/doc/html/signals2.html. [28] J. Holman, J. Lazar, J.H. Feng, and J. D’Arcy. “Developing usable CAPTCHAs for blind users.” In: Proceedings of the 9th international ACM SIGACCESS conference on Computers and ac- cessibility. ACM New York, NY, USA. 2007, pp. 245–246. Bibliography 109

[29] Per Jessen. FH DATE PAST 20XX scores on all mails dated 2010 or later. Apae SpamAssassin bug system. 2010. : https://issues.apache.org/SpamAssassin/show_bug.cgi? id=6269. [30] Jaeyeon Jung and Emil Sit. “An empirical study of spam traffic and the use of DNS bla lists.” In: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement. ACM New York, NY, USA. 2004, pp. 370–375. : http://portal.acm.org/citation.cfm?id= 1028838. [31] Stephen Kelly. Annual Osnabrü PIM Meeting Brings Exciting Announcements and Ambi- tious Plans. [Online accessed 13th February 2010]. 2010. : http://kdenews.org/2010/ 01/14/annual-osnabrck-pim-meeting-brings-exciting-announcements-and- ambitious-plans. [32] Alfons Kemper and Andre Eiler. Datenbanksysteme. 6., aktualis. u. erw. A. Oldenbourg Wissens.Vlg, 2006. : 3486576909. : http://www.amazon.com/exec/obidos/ redirect?tag=citeulike07-20&path=ASIN/3486576909. [33] J. Kim, K. Chung, and K. Choi. “Spam filtering with dynamically updated URL statistics.” In: IEEE Security & Privacy (2007), pp. 33–39. [34] C. Kreibi, C. Kani, K. Levenko, B. Enright, G.M. Voelker, V. Paxson, and S. Savage. “On the spam campaign trail.” In: First USENIX Workshop on Large-Scale Exploits and Emergent reats (LEET’08). 2008. : http://www.usenix.org/event/leet08/tech/full_ papers/kreibich/kreibich_html/. [35] Tobias König and Robert Zwerus. Akonadi design. [Online accessed 13th February 2010]. 2007. : http://api.kde.org/4.3-api/kdepim-apidocs/akonadi/html/akonadi_ design.html. [36] Sebastian Kügler. Selkie - Standalone . [Online accessed 4th February 2010]. 2009. : http://techbase.kde.org/index.php?title=Projects/Silk/Selkie& oldid=44551. [37] John R. Levine. “Experiences with Greylisting.” In: Conference on Email and Anti-Spam. Cite- seer. 2005. : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1. 84.2704&rep=rep1&type=pdf. [38] P. Lupkowski and M. Urbanski. “SemCAPTCHA—user-friendly alternative for OCR-based CAPTCHA systems.” In: Computer Science and Information Tenology, 2008. IMCSIT 2008. International Multiconference on (2008), pp. 325–329. [39] iago Macieira. Binary Compatibility Examples. [Online accessed 22nd January 2010]. 2009. : http : / / techbase . kde . org / index . php ? title = Policies / Binary _ Compatibility_Examples&oldid=44271. [40] Marco Martin. Status Notifier Specification. [Online accessed 10th February 2010]. 2009. : http://www.notmart.org/misc/statusnotifieritem/index.html. [41] M. Matsumoto and T. Nishimura. “Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator.” In: ACM Transactions on Modeling and Com- puter Simulation (TOMACS) 8.1 (1998), pp. 3–30. 110 Bibliography

[42] McAffee and ICF. e Carbon Footprint of Email Spam Report. [Online accessed 26th October 2009]. 2009. : http://resources.mcafee.com/content/NACarbonFootprintSpam. [43] omas McGuire. Add support for encapsulated messages to KMime. [Online accessed 9th February 2010]. 2010. : http://reviewboard.kde.org/r/2858/. [44] K. Moore and G. Vaudreuil. RFC-3464: An Extensible Message Format for Delivery Status No- tifications. 2003. [45] G. Mori and J. Malik. “Recognizing objects in adversarial cluer: Breaking a visual CAPTCHA.” In: 2003 IEEE Computer Society Conference on Computer Vision and Paern Recognition, 2003. Proceedings. Vol. 1. 2003. [46] J. Myers and M. Rose. “Post Office Protocol - Version 3.” In: RFC Editor United States (1996). [47] nic.at. Spamhaus.org anges nic.at listing. 2007. : http://www.nic.at/en/uebernic/ current_issues/nicat_news/news_view/period/1180648800/2591999/archived/ article/81/spamhausorg-aendert-nicat-listing. [48] OECD. Baground Paper For e OECD Workshop on Spam. [Online accessed 26th October 2009]. 2004. : http://www.olis.oecd.org/olis/2003doc.nsf/LinkTo/dsti- iccp(2003)10-final. [49] e Honeynet Project. Know Your Enemy Lite: Proxy reats - Sos v666. 2008. : http: //honeynet.org/papers/proxy. [50] e Honeynet Project. Know your Enemy: Traing Botnets. 2005. : http://honeynet. org/papers/bots. [51] e Spamhaus Project. Report on the criminal ’Ro Phish’ domains registered at Nic.at. 2007. : http://www.spamhaus.org/organization/statement.lasso?ref=7. [52] Anirudh Ramaandran, David Dagon, and Ni Feamster. “Can DNS-based blalists keep up with bots.” In: Conference on Email and Anti-Spam. Citeseer. 2006. : http://citeseerx. ist.psu.edu/viewdoc/download?doi=10.1.1.123.2270&rep=rep1&type=pdf. [53] P. Resni. “RFC2822: Internet message format.” In: RFC Editor United States (2001). [54] Rene Rivera. Shrink Wrapped Boost. [Online accessed 6th November 2009]. 2007. : http: //www.boost.org/users/uses_shrink.html. [55] M. Sahami, S. Dumais, D. Heerman, and E. Horvitz. “A Bayesian approa to filtering junk e-mail.”In: Learning for Text Categorization: Papers from the 1998 workshop. Vol. 62. Madison, Wisconsin: AAAI Tenical Report WS-98-05. 1998, pp. 98–05. [56] A. Slaikjer. A Dual-Use Spee CAPTCHA: Aiding Visually Impaired Web Users while Pro- viding Transcriptions of Audio Streams. Te. rep. Tenical Report CMU-LTI-07, 2007. [57] P. Shved and D. Silakov. “Binary Compatibility of Shared Libraries Implemented in C++ on GNU/Linux Systems.” In: SYRCoSE 2009 (), p. 17. [58] X. Spammer, S. Sjouwerman, and J. Posluns. Inside the spam cartel. Syngress Publishing, 2004. : 1932266860. [59] J. Stewart. “Top Spam Botnets Exposed.” In: April 8 (2008). [Online accessed 14th February 2010], p. 2008. : http://secureworks.com/research/threats/topbotnets/. Bibliography 111

[60] Ben Sto, Jan Göbel, Markus Engelberth, Felix C. Freiling, and orsten Holz. “Walowdac – Analysis of a Peer-to-Peer Botnet.” In: European Conference on Computer Network De- fense (EC2ND). 2009. : http://pi1.informatik.uni- mannheim.de/filepool/ publications/waledac-paper.pdf. [61] Bjarne Stroustrup. e C++ Programming Language. Special Edition. Addison-Wesley Long- man, Amsterdam, 2000. : 0201700735. [62] J. Tam, J. Simsa, D. Huggins-Daines, L. von Ahn, and M. Blum. “Improving audio captas.” In: Proc. of the 4th Symp. on Usability, Privacy and Security (SOUPS’08), Pisburgh, PA, USA. 2008. [63] Andrew S. Tanenbaum. Modern Operating Systems (2nd Edition) (GOAL Series). 2nd ed. Prentice Hall, 2001. : 0130313580. : http://www.amazon.com/exec/obidos/ redirect?tag=citeulike07-20&path=ASIN/0130313580. [64] Exim Bug Traer. Bug 133: MIME-format bounce messages. [Online accessed 8th January 2010]. 1999. : http://bugs.exim.org/show_bug.cgi?id=133. [65] Carnegie Mellon University. e Official CAPTCHA Site. [Online accessed 7th October 2009]. 2009. : http://captcha.net/. [66] L. Von Ahn, M. Blum, and J. Langford. “Telling humans and computers apart automatically.” In: Communications of the ACM 47.2 (2004), pp. 56–60. [67] Z. Wang, W. Josephson, Q. Lv, M. Charikar, and K. Li. “Filtering image spam with near- duplicate detection.” In: Proceedings of CEAS. Citeseer. 2007. [68] S. Weibel, J. Kunze, C. Lagoze, and M. Wolf. “RFC2413: Dublin Core Metadata for Resource Discovery.” In: RFC Editor United States (1998). [69] Jonathan Wilkins. Strong CAPTCHA Guidelines. [Online accessed 14th December 2009]. Dec. 2009. : http://bitland.net/captcha.pdf. [70] Carsten Willems, orsten Holz, and Felix C. Freiling. “CWSandbox: Towards Automated Dynamic Binary Analysis.” In: IEEE Security and Privacy 5.2 (2007). [71] Robert Zwerus. “Storing Personal Information Management data.” MA thesis. University of Twente, 2007. : http://eprints.eemcs.utwente.nl/11421/01/zwerus.pdf.

113

A Examples of Delivery Status Notifications

A.1 RFC Compliant

Received: by mail.martin-graesslin.com (Postfix) id 9D57770B002C; Fri, 12 Feb 2010 11:18:06 +0100 (CET) Date: Fri, 12 Feb 2010 11:18:06 +0100 (CET) From: [email protected] (Mail Delivery System) Subject: Undelivered Mail Returned to Sender To: [email protected] Auto-Submitted: auto-replied MIME-Version: 1.0 Content-Type: multipart/report; report-type=delivery-status; boundary="309B770B0006.1265969886/mail.martin-graesslin.com" Message-Id: <[email protected]> X-Length: 3914 X-UID: 619

This is a MIME-encapsulated message.

--309B770B0006.1265969886/mail.martin-graesslin.com Content-Description: Notification Content-Type: text/plain; charset=us-ascii

This is the mail system at host mail.martin-graesslin.com.

I’m sorry to have to inform you that your message could not be delivered to one or more recipients. It’s attached below.

For further assistance, please send mail to postmaster.

If you do so, please include this problem report. You can delete your own text from the attached returned message.

The mail system

<9-alp@dwl??????.com>: host mail.dwl??????.com[68.110.???.???] said: 550 5.1.1 User unknown (in reply to RCPT TO command)

--309B770B0006.1265969886/mail.martin-graesslin.com Content-Description: Delivery report Content-Type: message/delivery-status

Reporting-MTA: dns; mail.martin-graesslin.com X-Postfix-Queue-ID: 309B770B0006 X-Postfix-Sender: rfc822; [email protected] Arrival-Date: Fri, 12 Feb 2010 11:18:05 +0100 (CET)

Final-Recipient: rfc822; 9-alp@dwl??????.com Original-Recipient: rfc822;9-alp@dwl??????.com Action: failed Status: 5.1.1 Remote-MTA: dns; mail.dwl??????.com Diagnostic-Code: smtp; 550 5.1.1 User unknown

--309B770B0006.1265969886/mail.martin-graesslin.com Content-Description: Undelivered Message 114 A Examples of Delivery Status Notifications

Content-Type: message/rfc822

Received: from martin-apple.localnet (dslb-092-075-???-???.pools.arcor-ip.net [92.75.???.???]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.martin-graesslin.com (Postfix) with ESMTPSA id 309B770B0006 for <9-alp@dwl?????.com>; Fri, 12 Feb 2010 11:18:05 +0100 (CET) Content-Type: text/plain To: 9-alp@dwl??????.com From: Martin =?UTF-8?B?R3LDpMOfbGlu?= In-Reply-To: User-Agent: MailShake Akonadi Agent v0.0.1 Date: Fri, 12 Feb 2010 10:18:05 +0000 Message-Id: <[email protected]> MIME-Version: 1.0 Subject: Mail-Shake protected email address X-Mailshake-URL: http://mailhide.recaptcha.net/d?k=01IHRKvW1O6U1ki7iUtOjLdw== &c=f_SZ08wEIDWSkMzBdhPxFmf_GNWNGQWs1CKQzFyDAKMdHkyTAMX7i_y9mho2dn97

You sent an email to a private Mail-Shake address but your address is not whitelisted. The email will not be delivered. You have to send an email to the public address. You can retrieve the public address by visiting the following web address and solving the shown CAPTCHA: http://mailhide.recaptcha.net/d?k=01IHRKvW1O6U1ki7iUtOjLdw==&c=f_SZ08wEIDWSkMzBdhPxFmf_ GNWNGQWs1CKQzFyDAKMdHkyTAMX7i_y9mho2dn97

After sending an email to the public address you will receive a challenge email. This challenge contains an unique identifier. Please include this identifier in the subject of your original email and resent this email unchanged. It must be addressed to this private address.

There is no need to solve the CAPTCHA as stated in the email you will receive.

This message was generated automatically. Please do not reply.

--309B770B0006.1265969886/mail.martin-graesslin.com-- A.2 Exim Received: by mail.martin-graesslin.com (Postfix, from userid 114) id D5BFFB9CAB1C; Fri, 8 Jan 2010 14:53:37 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on v32201.1blu.de X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=ADVANCE_FEE_1,BAYES_00 autolearn=ham version=3.1.7-deb Received: from serv01.s?????.com (serv01.s?????.com [67.15.???.???]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.martin-graesslin.com (Postfix) with ESMTPS id EFABAB9CAAC8 for ; Fri, 8 Jan 2010 14:53:33 +0100 (CET) Received: from mailnull by serv01.s?????.com with local (Exim 4.69) id 1NTFHQ-0006lW-LE for [email protected]; Fri, 08 Jan 2010 07:53:32 -0600 X-Failed-Recipients: [email protected]?????.com Auto-Submitted: auto-replied From: Mail Delivery System To: [email protected] Subject: Mail delivery failed: returning message to sender Message-Id: Date: Fri, 08 Jan 2010 07:53:32 -0600 X-Length: 3876 X-UID: 11

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its recipients. This is a permanent error. The following address(es) failed: A.3 QMail 115

[email protected]?????.com (ultimately generated from [email protected]?????.com) retry timeout exceeded

------This is a copy of the message, including all the headers. ------

Return-path: Received: from [88.84.154.41] (port=35794 helo=mail.martin-graesslin.com) by serv01.s?????.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.69) (envelope-from ) id 1NTFHL-0006ge-JQ for [email protected]?????.com; Fri, 08 Jan 2010 07:53:32 -0600 Received: from martin-apple.localnet (dslb-094-217-???-???.pools.arcor-ip.net [94.217.???.???]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.martin-graesslin.com (Postfix) with ESMTPSA id 7ED0CB9CAAC8 for ; Fri, 8 Jan 2010 14:53:23 +0100 (CET) Content-Type: text/plain To: [email protected]?????.com From: Martin =?UTF-8?B?R3LDpMOfbGlu?= In-Reply-To: User-Agent: MailShake Akonadi Agent v0.0.1 Date: Fri, 08 Jan 2010 13:56:12 +0000 Message-Id: <[email protected]> MIME-Version: 1.0 Subject: Mail-Shake protected email address X-Mailshake-URL: http://mailhide.recaptcha.net/d?k=01hIwl2plCRiJx1DVTkL68IA== &c=H0dfFAXH7TjgNAQa1i_PI87adcV4G4f0lRdOkCM-X1oBSGblAWQWyUP6cu_XRRaI

You sent an email to a private Mail-Shake address but your address is not whitelisted. The email will not be delivered. You have to send an email to the public address. You can retrieve the public address by visiting the following web address and solving the shown CAPTCHA: http://mailhide.recaptcha.net/d?k=01hIwl2plCRiJx1DVTkL68IA== &c=H0dfFAXH7TjgNAQa1i_PI87adcV4G4f0lRdOkCM-X1oBSGblAWQWyUP6cu_XRRaI

After sending an email to the public address you will receive a challenge email. This challenge contains an unique identifier. Please include this identifier in the subject of your original email and resent this email unchanged. It must be addressed to this private address.

There is no need to solve the CAPTCHA as stated in the email you will receive.

This message was generated automatically. Please do not reply.

A.3 QMail A.3.1 MIME Mail Received: by mail.martin-graesslin.com (Postfix, from userid 114) id 83B3D70B0030; Sun, 31 Jan 2010 23:39:59 +0100 (CET) Received: from mail.w?????.net (mx3.w?????.net [66.232.???.???]) by mail.martin-graesslin.com (Postfix) with ESMTP id B556D70B002A for ; Sun, 31 Jan 2010 23:39:58 +0100 (CET) Received: (qmail 31369 invoked for bounce); 31 Jan 2010 22:39:57 -0000 Date: 31 Jan 2010 22:39:57 -0000 From: MAILER-DAEMON@w?????.net To: [email protected] MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="1264977597w?????.net18787" Subject: failure notice Message-Id: <[email protected]> X-Length: 3269 116 A Examples of Delivery Status Notifications

X-UID: 53

--1264977597w?????.net18787

Hi. This is the qmail-send program at whsecure.net. I’m afraid I wasn’t able to deliver your message to the following addresses. This is a permanent error; I’ve given up. Sorry it didn’t work out.

: user is over quota

--- Enclosed are the original headers of the message.

--1264977597w?????.net18787 Content-Type: message/rfc822

Return-Path: Received: (qmail 31366 invoked by uid 399); 31 Jan 2010 22:39:57 -0000 Delivered-To: pewit@w?????.com Received: (qmail 31356 invoked by uid 399); 31 Jan 2010 22:39:57 -0000 X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on mail.w?????.net X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=RDNS_NONE autolearn=disabled version=3.2.3 X-Virus-Scan: Scanned by ClamAV 0.91.2 (no viruses); Sun, 31 Jan 2010 17:39:57 -0500 Received: from unknown (HELO mail.martin-graesslin.com) (88.84.154.41) by mail.w?????.net with ESMTP; 31 Jan 2010 22:39:57 -0000 X-Originating-IP: 88.84.154.41 Received-SPF: none (mail.w?????.net: domain at martin-graesslin.com does not designate permitted sender hosts) identity=mailfrom; client-ip=88.84.154.41; envelope-from=; Received: from martin-apple.localnet (dslb-092-075-???-???.pools.arcor-ip.net [92.75.???.???]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.martin-graesslin.com (Postfix) with ESMTPSA id A770770B002A for ; Sun, 31 Jan 2010 23:39:55 +0100 (CET) Content-Type: text/plain To: pewit@w?????.com From: Martin =?UTF-8?B?R3LDpMOfbGlu?= In-Reply-To: <4B65FD23.8070405@w?????.com> User-Agent: MailShake Akonadi Agent v0.0.1 Date: Sun, 31 Jan 2010 22:39:55 +0000 Message-Id: <[email protected]> MIME-Version: 1.0 Subject: Mail-Shake protected email address X-Mailshake-URL: http://mailhide.recaptcha.net/d?k=01RGTsT_i-cmQAHLotVus7dg== &c=cyTN9GSTsxchuNlJFAUHVo2fv6DyNQOO-UTwJ5FNJdg=

(Body supressed)

--1264977597w?????.net18787--

A.3.2 Plain Text Mail Received: by mail.martin-graesslin.com (Postfix, from userid 114) id 042A970B002A; Thu, 4 Feb 2010 10:16:14 +0100 (CET) Received: from s541.e?????.de (s541.e?????.de [62.140.???.???]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.martin-graesslin.com (Postfix) with ESMTPS id 3B16070B0006 for ; Thu, 4 Feb 2010 10:16:12 +0100 (CET) Received: (qmail 10548 invoked for bounce); 4 Feb 2010 10:16:01 +0100 Date: 4 Feb 2010 10:16:01 +0100 From: [email protected]?????.de To: [email protected] Subject: failure notice A.4 Google Mail 117

Message-Id: <[email protected]> X-Length: 3478 X-UID: 442

Hi. This is the qmail-send program at s541.e?????.de. I’m afraid I wasn’t able to deliver your message to the following addresses. This is a permanent error; I’ve given up. Sorry it didn’t work out.

: This address no longer accepts mail.

--- Below this line is a copy of the message.

Return-Path: Received: (qmail 10543 invoked from network); 4 Feb 2010 10:16:00 +0100 Received: from v32201.1blu.de (HELO mail.martin-graesslin.com) (88.84.154.41) by b?????.at with SMTP; 4 Feb 2010 10:16:00 +0100 Received: from martin-apple.localnet (dslb-092-075-???-???.pools.arcor-ip.net [92.75.???.???]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.martin-graesslin.com (Postfix) with ESMTPSA id B41EB70B0006 for ; Thu, 4 Feb 2010 10:15:31 +0100 (CET) Content-Type: text/plain To: a?????@m?????.de From: Martin =?UTF-8?B?R3LDpMOfbGlu?= In-Reply-To: <000b01c1f47a$3f697d70$00426158@ufodundpmch> User-Agent: MailShake Akonadi Agent v0.0.1 Date: Thu, 04 Feb 2010 09:15:30 +0000 Message-Id: <[email protected]> MIME-Version: 1.0 Subject: Mail-Shake protected email address X-Mailshake-URL: http://mailhide.recaptcha.net/d?k=01IHRKvW1O6U1ki7iUtOjLdw== &c=f_SZ08wEIDWSkMzBdhPxFmf_GNWNGQWs1CKQzFyDAKMdHkyTAMX7i_y9mho2dn97

You sent an email to a private Mail-Shake address but your address is not whitelisted. The email will not be delivered. You have to send an email to the public address. You can retrieve the public address by visiting the following web address and solving the shown CAPTCHA: http://mailhide.recaptcha.net/d?k=01IHRKvW1O6U1ki7iUtOjLdw==&c= f_SZ08wEIDWSkMzBdhPxFmf_GNWNGQWs1CKQzFyDAKMdHkyTAMX7i_y9mho2dn97

After sending an email to the public address you will receive a challenge email. This challenge contains an unique identifier. Please include this identifier in the subject of your original email and resent this email unchanged. It must be addressed to this private address.

There is no need to solve the CAPTCHA as stated in the email you will receive.

This message was generated automatically. Please do not reply. A.4 Google Mail Received: by mail.martin-graesslin.com (Postfix, from userid 114) id 4E3F970B002C; Thu, 11 Feb 2010 11:05:12 +0100 (CET) Received: from mail-bw0-f163.google.com (mail-bw0-f163.google.com [209.85.218.163]) by mail.martin-graesslin.com (Postfix) with ESMTP id 3A7D170B0006 for ; Thu, 11 Feb 2010 11:05:03 +0100 (CET) Received: by bwz3 with SMTP id 3so76417bwz.11 for ; Thu, 11 Feb 2010 02:05:02 -0800 (PST) Received: by 10.204.36.71 with SMTP id s7mr935917bkd.171.1265882702557; Thu, 11 Feb 2010 02:05:02 -0800 (PST) MIME-Version: 1.0 Received: by 10.204.36.71 with SMTP id s7mr1108092bkd.171; Thu, 11 Feb 2010 02:05:02 -0800 (PST) From: Mail Delivery Subsystem To: [email protected] Subject: Delivery Status Notification (Delay) 118 A Examples of Delivery Status Notifications

Message-ID: <[email protected]> Date: Thu, 11 Feb 2010 10:05:02 +0000 Content-Type: text/plain; charset=ISO-8859-1 X-Length: 4119 X-UID: 75

This is an automatically generated Delivery Status Notification

THIS IS A WARNING MESSAGE ONLY.

YOU DO NOT NEED TO RESEND YOUR MESSAGE.

Delivery to the following recipient has been delayed:

[email protected]

Message will be retried for 2 more day(s)

----- Original message -----

Received: by 10.204.36.71 with SMTP id s7mr4756929bkd.171.1265794076529; Wed, 10 Feb 2010 01:27:56 -0800 (PST) Return-Path: Received: from mail.martin-graesslin.com (v32201.1blu.de [88.84.154.41]) by mx.google.com with ESMTP id 19si2171974bwz.8.2010.02.10.01.27.55; Wed, 10 Feb 2010 01:27:56 -0800 (PST) Received-SPF: neutral (google.com: 88.84.154.41 is neither permitted nor denied by best guess record for domain of [email protected]) client-ip=88.84.154.41; Authentication-Results: mx.google.com; spf=neutral (google.com: 88.84.154.41 is neither permitted nor denied by best guess record for domain of [email protected]) [email protected] Received: from martin-apple.localnet (dslb-092-075-???-???.pools.arcor-ip.net [92.75.???.???]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.martin-graesslin.com (Postfix) with ESMTPSA id 3820B70B0008 for ; Tue, 9 Feb 2010 08:42:04 +0100 (CET) Content-Type: text/plain To: [email protected] From: Martin =?UTF-8?B?R3LDpMOfbGlu?= In-Reply-To: <[email protected]> User-Agent: MailShake Akonadi Agent v0.0.1 Date: Tue, 09 Feb 2010 07:42:02 +0000 Message-Id: <[email protected]> MIME-Version: 1.0 Subject: Mail-Shake protected email address X-Mailshake-URL: http://mailhide.recaptcha.net/d?k=01RGTsT_i-cmQAHLotVus7dg== &c=cyTN9GSTsxchuNlJFAUHVo2fv6DyNQOO-UTwJ5FNJdg=

You sent an email to a private Mail-Shake address but your address is not whitelisted. The email will not be delivered. You have to send an email to the public address. You can retrieve the public address by visiting the following web address and solving the shown CAPTCHA: http://mailhide.recaptcha.net/d?k=01RGTsT_i-cmQAHLotVus7dg==&c=cyTN9GSTsxchuNlJFAUHVo2fv6DyNQOO-UTwJ5FNJdg=

After sending an email to the public address you will receive a challenge email. This challenge contains an unique identifier. Please include this identifier in the subject of your original email and resent this email unchanged. It must be addressed to this private address.

There is no need to solve the CAPTCHA as stated in the email you will receive.

This message was generated automatically. Please do not reply. 119

B Mails from Automated Systems

B.1 Review Board

Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: Re: Review Request: Updated Kwin desktop on-screen display animations to use QPropertyAnimation From: =?utf-8?q?Martin_Gr=C3=A4=C3=9Flin?= To: =?utf-8?q?Martin_Gr=C3=A4=C3=9Flin?= Date: Sat, 06 Feb 2010 20:43:51 -0000 Message-ID: <20100206204351.2179.70516@localhost> In-Reply-To: <20100206202526.2179.67673@localhost> References: <20100206202526.2179.67673@localhost> X-Authenticated-User: [email protected] X-Authenticator: login X-Invalid-HELO: HELO is no FQDN (contains no dot) (See RFC2821 4.1.1.1) X-Exim-Version: 4.69 (build at 02-Feb-2008 04:50:35) X-Date: 2010-02-06 21:44:01 X-Connected-IP: 127.0.0.1:53609 X-Message-Linecount: 133 X-Body-Linecount: 117 X-Message-Size: 3172 X-Body-Size: 2410 X-Received-Count: 1 X-Recipient-Count: 7 X-Local-Recipient-Count: 7 X-Local-Recipient-Defer-Count: 0 X-Local-Recipient-Fail-Count: 0 X-Length: 4824 X-UID: 78

B.2 Bugzilla

From: =?utf-8?q?Martin_Gr=C3=A4=C3=9Flin?= Sender: [email protected] To: [email protected] Reply-To: [email protected] List-Post: Subject: [Bug 219802] Windows like logout/shutdown, the task switcher (Alt+) and some PopupApplets and Tooltips have artifacts (black corners) [setMask issue] X-Bugzilla-URL: http://bugs.kde.org/ X-Bugzilla-Reason: CC X-Bugzilla-Type: newchanged X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: plasma X-Bugzilla-Component: general X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: [email protected] X-Bugzilla-Status: NEW X-Bugzilla-Priority: NOR X-Bugzilla-Assigned-To: [email protected] X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: CC In-Reply-To: References: Auto-Submitted: auto-generated 120 B Mails from Automated Systems

Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Message-Id: <[email protected]> Date: Wed, 10 Feb 2010 21:40:01 +0100 (CET) X-Length: 2591 X-UID: 588 121

C Mail-Shake API Documentation

C.1 MailShake Namespace Reference Classes • class DSN • class EMail • class Id • class MailShake e main interface for MailShake.

• class DSNPrivate • class EMailPrivate • class IdPrivate • class MailShakePrivate • class WhiteListEntryPrivate • class WhiteListEntry An entry in the whitelist.

Typedefs • typedef boost::signal< void(const EMail &)> mailSignal • typedef boost::signal< void(const Id ∗) idSignal ) • typedef boost::signal< void(WhiteListEntry ∗) whiteListSignal )

Enumerations • enum MatType { ExactMat, CaseInsensitiveMat, PartMat, CaseInsensitivePartMat, RegExMat }

C.1.1 Detailed Description Mail-Shake is a concept to prevent spam. erefore an approa similar to the public key cryptography is used. Ea user has two email addresses: a public and a private. Only emails to the private address will be read by a user, but the sender has to authenticate that he is a human and not a spam sending bot. erefore he has to send an email to the public address first. is email will be answered with an automatic response containing a allenge and a unique identifier. By solving the allenge the private email address is revealed. e sender can resend the original email together with the unique identifier to the private address and if the identifier is correct the sender has successfully authenticated himself and his address is added to a whitelist. For more details please see: is library implements the Mail-Shake approa. It is able to decide if an email has to be dropped or can be forwarded to the user. It is also able to generate the response emails by using values from a given configuration object. It is recommended to use the Mailhide API for the allenge. Nevertheless it is possible to use any other CAPTCHA tenology and include as an aament to the allenge email. Receiving, sending and deleting emails is not handled inside this library. is has to be implemented inside a client implementation whi uses this library. By that it is possible to use the existing code of various client implementations. So there can be a client implementation whi uses for example Akonadi and the kdepimlibs or an implementation as a underbird extension. By not implementing receiving and sending of emails the library can be kept in a small size. Another advantage is that the client side implementation can use existing tenology to connect to various email resources su as IMAP or POP3 without implementing ea protocol inside the library. e client implementation of Mail-Shake should also provide a configuration interface to manage the whitelist and Mail-Shake in general. is library does not provide a configuration interface. Nevertheless there is MailShake::Config whi has to be used to pass the configured values to MailShake. 122 C Mail-Shake API Documentation

C.1.2 Typedef Documentation C.1.2.1 typedef boost::signal

C.1.2.2 typedef boost::signal MailShake::mailSignal e type for a signal taking one reference to an EMail.

C.1.2.3 typedef boost::signal

C.1.3 Enumeration Type Documentation C.1.3.1 enum MailShake::MatType e MatType describes in what way the filter has to be applied to the tested data of a whitelist entry.

Enumerator: ExactMatch e mat has to be an exact mat. CaseInsensitiveMatch e mat has to be an exact mat ignoring case. PartMatch e filter has to be part of the tested string. CaseInsensitivePartMatch e filter has to be part of the tested string ignoring case. RegExMatch e filter is used as a regular expression.

C.2 MailShake::DSN Class Reference

#include

Public Member Functions • EMail & originalEMail () const • void setOriginalEMail (const EMail &mail)

C.2.1 Detailed Description is class represents a Delivery Status Notification (DSN) for the scope of MailShake. It is a normal MailShake::EMail with the original email ”aaed”. is class is not a correct implementation of RFC 3464, it is wrien for the usage inside of MailShake. It is the task of the client side implementation to distinguish if an email is a DSN. In that case it should pass a MailShake::DSN wherever a MailShake::EMail is required and if the DSN contains the original sent email it should be set.

Author: Martin Gräßlin

C.2.2 Member Function Documentation C.2.2.1 EMail & MailShake::DSN::originalEMail () const Returns: e original email aaed to the DSN.

C.2.2.2 void MailShake::DSN::setOriginalEMail (const EMail & mail) Parameters: mail e original email aaed to the DSN.

e documentation for this class was generated from the following files:

• dsn.h • dsn.cpp C.3 MailShake::DSNPrivate Class Reference 123

C.3 MailShake::DSNPrivate Class Reference Public Attributes • EMail originalEMail

e documentation for this class was generated from the following files: • private/dsn_p.h • private/dsn_p.cpp

C.4 MailShake::EMail Class Reference #include

Public Member Functions • EMail & operator= (const EMail &mail) • bool operator== (const EMail &mail) • string fromAddress () const • void setFromAddress (const string &address) • string replyTo () const • void setReplyTo (const string &address) • string sender () const • void setSender (const string &address) • string toAddress () const • void setToAddress (const string &address) • string messageId () const • void setMessageId (const string &id) • string inReplyTo () const • void setInReplyTo (const string &id) • string allengeId () const • void setChallengeId (const string &uid) • string allengeURL () const • void setChallengeURL (const string &url) • string allengeResponseId () const • void setChallengeResponseId (const string &id) • string subject () const • void setSubject (const string &subject) • void addHeader (const std::string &header, const std::string &datum) • const std::list< std::string > ∗ header (const std::string &name) • const std::map< std::string, std::list< std::string > ∗ > & headers () const • EMail replyMail () const • bool isMailShakeChallenge () const • bool isMailShakeChallengeReply () const • bool isSpam () const • void setSpam (bool spam) • bool isReplyToSpam () const • void setReplyToSpam (bool spam) C.4.1 Detailed Description is class represents an email in MailShake. It does not represent a valid implementation for RFC 822 mails. e class structure is based on the required usage to decide if a mail has to be dropped or to generate a response mail. e email implementation expects the data to be US-ASCII. is class does not implement MIME. If a received email uses a MIME content whi is required in MailShake (e.g. a Delivery Status Notification) the client implementation has to present the EMail in a way that it can be used by MailShake. e same applies for mails passed to the client for sending. e client has to produce a valid RFC 822 email from given data. Ea address in this class has to be just the address without the optional name and without < and >. E.g. it has to be ”[email protected]” instead of ”Foobar ” is implies that the addresses are no mailboxes. Ea address field only contains one address and not many. at is because MailShake only requires one address: the original sender whi will become the TO address in the reply. Other addresses like CC and BCC or even TO are not relevant and by that not implemented.

Author: Martin Gräßlin 124 C Mail-Shake API Documentation

C.4.2 Member Function Documentation C.4.2.1 void MailShake::EMail::addHeader (const std::string & header, const std::string & datum) Adds the header and appends the given datum if the header already exists.

Parameters: header e header to add datum e single datum of the header to be appended

C.4.2.2 string MailShake::EMail::challengeId () const Returns: e unique id of this MailShake allenge.

C.4.2.3 string MailShake::EMail::challengeResponseId () const Returns: e response Id stored in a X-Mailshake-Response-ID header or in the subject.

C.4.2.4 string MailShake::EMail::challengeURL () const Returns: e URL for the MailShake allenge.

C.4.2.5 string MailShake::EMail::fromAddress () const Returns: the from address. An empty string may be returned.

C.4.2.6 const std::list< string > ∗ MailShake::EMail::header (const std::string & name) Parameters: name e name of the header

Returns: a list of all data elements for given header. If there is no header null is returned.

C.4.2.7 const std::map< std::string, std::list< std::string > ∗ > & MailShake::EMail::headers () const Returns: the headers

C.4.2.8 string MailShake::EMail::inReplyTo () const Returns: the message id the mail is in reply to.

C.4.2.9 bool MailShake::EMail::isMailShakeChallenge () const Returns: true if the email is a MailShake Challenge email. at is if the email has a X-Mailshake-ID header.

C.4.2.10 bool MailShake::EMail::isMailShakeChallengeReply () const Returns: true if the email is a reply to a MailShake Challenge email. It tests for a X-Mailshake-Response-ID header or if the subject contains a possible ID.

C.4.2.11 bool MailShake::EMail::isReplyToSpam () const Returns: true if the EMail is a reply mail to another EMail whi is classified as Spam.

C.4.2.12 bool MailShake::EMail::isSpam () const Returns: true if the EMail has been classified as Spam.

C.4.2.13 string MailShake::EMail::messageId () const Returns: e message Id. An empty string may be returned. C.4 MailShake::EMail Class Reference 125

C.4.2.14 EMail MailShake::EMail::replyMail () const Returns: A reply mail to this mail. It will set the to address and the inreplyTo header. e To address is determined following the recommendations in section 4.4.4 of RFC 822.

C.4.2.15 string MailShake::EMail::replyTo () const Returns: the reply to address. An empty string may be returned.

C.4.2.16 string MailShake::EMail::sender () const Returns: the sender address. An empty string may be returned.

C.4.2.17 void MailShake::EMail::setChallengeId (const string & uid) Parameters: uid e unique id of this MailShake allenge.

C.4.2.18 void MailShake::EMail::setChallengeResponseId (const string & id) Parameters: id e response Id from a X-Mailshake-Response-ID header.

C.4.2.19 void MailShake::EMail::setChallengeURL (const string & url) Parameters: url e URL for the MailShake allenge.

C.4.2.20 void MailShake::EMail::setFromAddress (const string & address) Parameters: address e from address.

C.4.2.21 void MailShake::EMail::setInReplyTo (const string & id) Parameters: id e message id the mail is in reply to.

C.4.2.22 void MailShake::EMail::setMessageId (const string & id) Parameters: id e id of this message.

C.4.2.23 void MailShake::EMail::setReplyTo (const string & address) Parameters: address e reply to address.

C.4.2.24 void MailShake::EMail::setReplyToSpam (bool spam) Parameters: spam is EMail is a reply to a mail classified as Spam.

C.4.2.25 void MailShake::EMail::setSender (const string & address) Parameters: address e sender address.

C.4.2.26 void MailShake::EMail::setSpam (bool spam) Parameters: spam Classifies this mail as Spam

C.4.2.27 void MailShake::EMail::setSubject (const string & subject) Parameters: subject e subject

C.4.2.28 void MailShake::EMail::setToAddress (const string & address) Parameters: address e TO address.

C.4.2.29 string MailShake::EMail::subject () const Returns: e subject. 126 C Mail-Shake API Documentation

C.4.2.30 string MailShake::EMail::toAddress () const Returns: the TO address. An empty string may be returned.

e documentation for this class was generated from the following files:

• email.h • email.cpp

C.5 MailShake::EMailPrivate Class Reference Public Attributes • std::string toAddress • std::string fromAddress • std::string senderAddress • std::string replyToAddress • std::string messageID • std::string inReplyTo • std::string allengeId • std::string allengeUrl • std::string allengeResponseId • std::string subject • bool spam • bool replyToSpam • std::map< std::string, std::list< std::string > ∗ > headers

e documentation for this class was generated from the following files:

• private/email_p.h • private/email_p.cpp

C.6 MailShake::Id Class Reference #include

Public Member Functions • Id (const std::string &id, const std::string &address, const boost::posix_time::ptime date) • std::string uniqueId () const • void setUniqueId (const std::string &uid) • std::string address () const • void setAddress (const std::string &address) • boost::posix_time::ptime date () const • void setDate (const boost::posix_time::ptime &date)

C.6.1 Detailed Description is class represents one unique Id as used by MailShake. One Id consists of the unique key, the associated email address (from) and the date when the Id has been created.

Author: Martin Gräßlin

C.6.2 Member Function Documentation C.6.2.1 std::string MailShake::Id::address () const Returns: e email address associated with this Id C.7 MailShake::IdPrivate Class Reference 127

C.6.2.2 boost::posix˙time::ptime MailShake::Id::date () const Returns: e date of this id C.6.2.3 void MailShake::Id::setAddress (const std::string & address) Parameters: address e email address to be associated with this Id C.6.2.4 void MailShake::Id::setDate (const boost::posix˙time::ptime & date) Parameters: date e date to be set for this Id C.6.2.5 void MailShake::Id::setUniqueId (const std::string & uid) Parameters: uid e unique identifier of this Id C.6.2.6 std::string MailShake::Id::uniqueId () const Returns: e unique identifier of this Id e documentation for this class was generated from the following files: • id.h • id.cpp

C.7 MailShake::IdPrivate Class Reference Public Attributes • std::string address • std::string id • boost::posix_time::ptime date e documentation for this class was generated from the following files: • private/id_p.h • private/id_p.cpp

C.8 MailShake::MailShake Class Reference

e main interface for MailShake. #include

Public Member Functions • bool privateMailReceived (const EMail &mail) • bool testPrivateMailReceived (const EMail &mail) const • void publicMailReceived (const EMail &mail) • void privateMailSent (const EMail &mail) • void addWhiteListEntry (WhiteListEntry ∗entry) • void addWhiteListEntries (std::list< WhiteListEntry ∗ > entries) • const std::map< std::string, Id ∗ > & idStorage () const • void setIdStorage (const std::map< std::string, Id ∗ > &storage) • std::list< WhiteListEntry ∗ > temporaryWhitelist () const • void setTemporaryWhitelist (const std::list< WhiteListEntry ∗ > &whitelist) • std::list< WhiteListEntry ∗ > clearTemporaryWhitelist () • std::list< WhiteListEntry ∗ > whitelist () const • void setWhitelist (const std::list< WhiteListEntry ∗ > &whitelist) • void signalChallengeEmail (const mailSignal::slot_type &slot) • void signalNotifyMailshake (const mailSignal::slot_type &slot) • void signalNotifyIncorrectId (const mailSignal::slot_type &slot) • void signalNotifyIncorrectAddress (const mailSignal::slot_type &slot) • void signalIdAdded (const idSignal::slot_type &slot) • void signalIdRemoved (const idSignal::slot_type &slot) • void signalWhiteListEntryAdded (const whiteListSignal::slot_type &slot) • void signalTemporaryWhiteListEntryAdded (const whiteListSignal::slot_type &slot) • void signalTemporaryWhiteListEntryRemoved (const whiteListSignal::slot_type &slot) 128 C Mail-Shake API Documentation

C.8.1 Detailed Description e main interface for MailShake. is class is the interface for the MailShake approa. It implements the handling of receiving emails and will decide if an email has to be dropped and if a reply has to be sent.

Author: Martin Gräßlin

C.8.2 Member Function Documentation C.8.2.1 void MailShake::MailShake::addWhiteListEntries (std::list< WhiteListEntry ∗ > entries) Convinient method to add a list of Whitelist entries to the permanent whitelist.

Parameters: entries e list of Whitelist entries to be added

See also: addWhiteListEntry

C.8.2.2 void MailShake::MailShake::addWhiteListEntry (WhiteListEntry ∗ entry) Adds a new entry to the permanent whitelist.

Parameters: entry Whitelist entry whi has to be added

See also: addWhiteListEntries

C.8.2.3 std::list< WhiteListEntry ∗ > MailShake::MailShake::clearTemporaryWhitelist () is method removes all elements from the temporary whitelist. e signal temporaryWhitelistRemoved will not be emied and existing entries are not deleted. It is the responsibility of the caller to cleanup all existing pointers.

Returns: the previous temporary whitelist

C.8.2.4 const map< string, Id ∗ > & MailShake::MailShake::idStorage () const Returns: e map used to store the ids. e key is the unique id, the value is an MailShake::Id object containing id, address and date. It is the task of the client side implementation to store this map durable.

C.8.2.5 bool MailShake::MailShake::privateMailReceived (const EMail & mail) is method has to be invoked whenever an email arrives at the private address. It is completely responsible to decide if the email has to be dropped or not. e client implementation has to take care about dropping the email or forwarding it to the user. It is recommended that the client implementation adds a header X-Mailshake-Whitelisted to indicate if the email has to be dropped.

Parameters: mail e email received by the client implementation for the private email address.

Returns: true if the email can be forwarded to the user, false if it has to be dropped.

C.8.2.6 void MailShake::MailShake::privateMailSent (const EMail & mail) is method has to be invoked whenever the user of the private email address sends an email. is method will add the receiver to the temporary or permanent whitelist. IMPORTANT: If this method is not called, Mail-Shake will not work correctly as email responses to the private email address will be dropped.

Parameters: mail e email sent by the user

C.8.2.7 void MailShake::MailShake::publicMailReceived (const EMail & mail) is method has to be invoked whenever an email arrives at the public address. It will decide if a Mail-Shake allenge response email will be sent or not. erefore this method will emit the signalChallengeEmail. e client implementation has to connect to this signal. e client implementation has to take care of dropping the email aer being processed by this method.

Parameters: mail e email received by the client implementation for the public email address C.8 MailShake::MailShake Class Reference 129

C.8.2.8 void MailShake::MailShake::setIdStorage (const std::map< std::string, Id ∗ > & storage) Parameters: storage A new map of unique ids to be used by MailShake. As loading the Id storage is done by the client side implementation this method has to be invoked when a MailShake process is started.

C.8.2.9 void MailShake::MailShake::setTemporaryWhitelist (const std::list< WhiteListEntry ∗ > & whitelist) is method has to be invoked by the client side implementation when a MailShake process is started. All items are added to the whitelist. If there are existing entries for the same filter the element will be replaced by the new one. Other existing elements will not be removed. If you want to remove all elements from the temporary whitelist you should use clearTemporaryWhitelist.

Parameters: whitelist e temporary whitelist to be used by MailShake.

C.8.2.10 void MailShake::MailShake::setWhitelist (const std::list< WhiteListEntry ∗ > & whitelist) Parameters: whitelist e whitelist to be used by MailShake. is method has to be invoked by the client side implementation when a MailShake process is started.

C.8.2.11 void MailShake::MailShake::signalChallengeEmail (const mailSignal::slot˙type & slot) e signal for sending a allenge email. is signal is emied when an email has been received by publicEmailReceived. e client implementation connected to this signal has to send the given email and set the additional data su as sender address, subject, allenge URL and the actual text. A client implementation has to connect to this signal. e slot takes one parameter: const MailShake::EMail&

Parameters: slot e slot on client side

C.8.2.12 void MailShake::MailShake::signalIdAdded (const idSignal::slot˙type & slot) is signal is emied when a new Id has been added to the storage. e client whi connects to this signal has to store the Id in a persistent way.

Parameters: slot e slot on client side.

C.8.2.13 void MailShake::MailShake::signalIdRemoved (const idSignal::slot˙type & slot) is signal is emied when an Id has been removed form the storage. e client whi connects to this signal has to remove the Id from the persistent storage.

Parameters: slot e slot on client side.

C.8.2.14 void MailShake::MailShake::signalNotifyIncorrectAddress (const mailSignal::slot˙type & slot) e signal for sending a notification of incorrect sender address. is signal is emied when an email has been received by privateMailReceived and contains a valid MailShake Id, but has not been sent with the same from address as the email to the public address. e client side implementation has to take care of sending the email and set the sender, subject and actual email data. A client implementation has to connect to this signal

Parameters: slot e slot on client side

C.8.2.15 void MailShake::MailShake::signalNotifyIncorrectId (const mailSignal::slot˙type & slot) e signal for sending a notification of incorrect Id. is signal is emied when an email has been received by privateMailReceived and contains a MailShake Id whi is incorrect. e client implementation connecting to this signal has to send the given email and prepare it, that includes seing the sender address, subject and text. A client implementation has to connect to this signal. e slot takes one parameter: const MailShake::EMail&

Parameters: slot e slot on client side 130 C Mail-Shake API Documentation

C.8.2.16 void MailShake::MailShake::signalNotifyMailshake (const mailSignal::slot˙type & slot) e signal for sending a notification that an email will be dropped. is signal is emied when an email is received by privateMailReceived and is not authenticated. It is up to the client to decide if it should connect to this signal and send the notification. If it sends the notification it should set the sender address, subject and text. e slot takes one parameter: const MailShake::EMail&

Parameters: slot e slot on client side

C.8.2.17 void MailShake::MailShake::signalTemporaryWhiteListEntryAdded (const whiteListSignal::slot˙type & slot) is signal is emied when a WhiteListEntry has been added to the temporary whitelist. e client whi connects to this signal has to store the WhiteListEntry in a persistent way.

Parameters: slot e slot on client side.

C.8.2.18 void MailShake::MailShake::signalTemporaryWhiteListEntryRemoved (const whiteListSignal::slot˙type & slot) is signal is emied when a WhiteListEntry has been removed from the temporary whitelist. e client whi connects to this signal has to remove the WhiteListEntry from the persistent storage.

Parameters: slot e slot on client side.

C.8.2.19 void MailShake::MailShake::signalWhiteListEntryAdded (const whiteListSignal::slot˙type & slot) is signal is emied when a WhiteListEntry has been added to the whitelist. e client whi connects to this signal has to store the WhiteListEntry in a persistent way.

Parameters: slot e slot on client side.

C.8.2.20 std::list< WhiteListEntry ∗ > MailShake::MailShake::temporaryWhitelist () const Returns: e temporary whitelist. It is the task of the client side implementation to store this whitelist durable.

C.8.2.21 bool MailShake::MailShake::testPrivateMailReceived (const EMail & mail) const Performs the same tests as privateMailReceived and returns the same result, but does not modify the Whitelist, id storage and does not emit signals. It is a convenient method to test if a received mail is whitelisted.

Parameters: mail e mail to e

Returns: true if the mail would be whitelisted in privateMailReceived, false otherwise

See also: privateMailReceived

C.8.2.22 std::list< WhiteListEntry ∗ > MailShake::MailShake::whitelist () const Returns: e whitelist. It is the task of the client side implementation to store this whitelist durable.

e documentation for this class was generated from the following files:

• mailshake.h • mailshake.cpp C.9 MailShake::MailShakePrivate Class Reference 131

C.9 MailShake::MailShakePrivate Class Reference Public Member Functions • std::string nextId () • WhiteListEntry ∗ entry (const std::list< WhiteListEntry ∗ > &list, const std::string &data, const std::string &header=std::string()) const • bool privateMailOnWhitelist (const EMail &mail) • bool isOnTemporaryWhitelist (const std::string &address) • void addTemporaryWhitelistEntry (WhiteListEntry ∗entry) • WhiteListEntry ∗ removeTemporaryWhiteListEntry (const std::string &address) • void moveOrAddEntryToPermanent (const std::string &address)

Public Attributes • boost::mt19937 rng • boost::uniform_int distribution • boost::variate_generator< boost::mt19937 &, boost::uniform_int<> > generator • std::map< std::string, Id ∗ > idStorage • std::map< std::string, std::list< WhiteListEntry ∗ > > whitelist • std::map< std::string, WhiteListEntry ∗ > temporaryAddressWhiteList • mailSignal allenge • mailSignal notifyId • mailSignal notifyAddress • mailSignal notifyMailshake • idSignal idAdded • idSignal idRemoved • whiteListSignal whitelistAdded • whiteListSignal temporaryWhitelistAdded • whiteListSignal temporaryWhitelistRemoved

C.9.1 Member Function Documentation C.9.1.1 void MailShake::MailShakePrivate::addTemporaryWhitelistEntry (WhiteListEntry ∗ entry) Adds the given WhiteListEntry to the temporary whitelist and replaces any existing entry for the address used in the given WhiteListEntry. It will fire the signal temporaryWhitelistRemoved

Parameters: entry e WhiteListEntry to be added to the temporary whitelist

C.9.1.2 WhiteListEntry ∗ MailShake::MailShakePrivate::entry (const std::list< WhiteListEntry ∗ > & list, const std::string & data, const std::string & header = std::string()) const Parameters: list e whitelist to sear for the data data e data whose applying first entry is found header e header whi should be used for filtering. If empty the address data will be tested

Returns: e WhiteListEntry whi mates the given data, NULL if there is no mating entry.

C.9.1.3 bool MailShake::MailShakePrivate::isOnTemporaryWhitelist (const std::string & address) Returns: if the given address mates a whitelist entry in case insensitive way on the temporary whitelist. 132 C Mail-Shake API Documentation

C.9.1.4 void MailShake::MailShakePrivate::moveOrAddEntryToPermanent (const std::string & address) Removes the WhiteListEntry for the given address from the temporary whitelists and adds it to the permanent whitelist. If the temporary whitelist does not contain an entry a new one will be added to the permanent whitelist. If the permanent whitelist already contains a WhiteListEntry for the given address, no new one will be added, but the one from the temporary whitelist is removed nevertheless.

Parameters: address e address for whi the WhiteListEntry has to be moved or a new one added

C.9.1.5 bool MailShake::MailShakePrivate::privateMailOnWhitelist (const EMail & mail) Ches if the given mail is on the permanent whitelist. Notes to complexity:

• A mat for the address is O(n) with n being the number of elements in the address whitelist.

• e complexity of a header is O(n+m∗log(l)∗k), with n the number of elements in the address whitelist, m the number of headers in the mail and l the number of different headers in the whitelist and k the number of elements in one header whitelist. We can assume that l << m and k << m as there are mu more headers for whi no whitelist entry exists than headers for whi a whitelistentry exists. In general only a few generic entries will be added to the whitelist. e dominating factor of this algorithm is m∗log(l). So we can say that the complexity in general is O(nlog(n)).

Parameters: mail e mail to e

Returns: true if the mail is whitelisted, false otherwise.

C.9.1.6 WhiteListEntry ∗ MailShake::MailShakePrivate::removeTemporaryWhiteListEntry (const std::string & address) Removes from the temporary whitelist and returns the WhiteListEntry for the given address. It will fire the signal temporaryWhitelistRemoved. It’s the responsibility of the caller to delete the entry.

Parameters: address e address for whi the corresponding whitelist Entry should be removed.

Returns: e removed WhiteListEntry or NULL if the temporary whitelist did not contain the entry.

C.9.2 Member Data Documentation C.9.2.1 std::map Mail- Shake::MailShakePrivate::temporaryAddressWhiteList e temporary whitelist is only used for address mating in a case insensitive way. ere is actually no need to have more than one on the temporary whitelist for ea address. So when a new entry is added for an existing string, the old one will be removed.

C.9.2.2 std::map > MailShake::MailShakePrivate::whitelist e whitelist is a map consisting of the lowercase header name as a key and a list of WhiteListEntries as the value. e empty string as key is used for address mating values. e documentation for this class was generated from the following files:

• private/mailshake_p.h • private/mailshake_p.cpp

C.10 MailShake::WhiteListEntry Class Reference

An entry in the whitelist. #include C.10 MailShake::WhiteListEntry Class Reference 133

Public Member Functions • boost::posix_time::ptime date () const • void setDate (const boost::posix_time::ptime &date) • std::string header () const • void setHeader (const std::string &header) • bool isAddressFiltering () const • void enableAddressFiltering (bool enable) const • std::string filter () const • void setFilter (const std::string &filter) • MatType matType () const • void setMatType (MatType mat) • bool mates (const std::string &data) • void setUId (uint uid) • uint uId () const

C.10.1 Detailed Description An entry in the whitelist. is class represents an entry in the whitelist. It consists of the date, the header or address to filter, the filter itself and how the filter has to be used.

Author: Martin Gräßlin

C.10.2 Member Function Documentation C.10.2.1 boost::posix˙time::ptime MailShake::WhiteListEntry::date () const Returns: e date when the entry was added to the whitelist.

C.10.2.2 void MailShake::WhiteListEntry::enableAddressFiltering (bool enable) const Parameters: enable Enables address filtering.

C.10.2.3 std::string MailShake::WhiteListEntry::filter () const Returns: e filter whi is applied to address or header data.

C.10.2.4 std::string MailShake::WhiteListEntry::header () const Returns: e name of the header for whi this entry is used. If the entry is for an address an empty string will be returned.

See also: isAddressFiltering

C.10.2.5 bool MailShake::WhiteListEntry::isAddressFiltering () const Returns: true if this entry is used on the sender addresses, else if it is used on generic headers.

C.10.2.6 bool MailShake::WhiteListEntry::matches (const std::string & data) Parameters: data e data to be tested

Returns: true if the given data mates the filter.

C.10.2.7 MatType MailShake::WhiteListEntry::matchType () const Returns: e MatType to be used by this whitelist entry.

C.10.2.8 void MailShake::WhiteListEntry::setDate (const boost::posix˙time::ptime & date) Parameters: date e date when the entry was added to the whitelist. 134 C Mail-Shake API Documentation

C.10.2.9 void MailShake::WhiteListEntry::setFilter (const std::string & filter) Parameters: filter e filter to be applied to address or header data.

C.10.2.10 void MailShake::WhiteListEntry::setHeader (const std::string & header) Parameters: header e name of the header this entry should filter. Please disable address filtering when seing a header

See also: enableAddressFiltering

C.10.2.11 void MailShake::WhiteListEntry::setMatchType (MatType match) Parameters: match e MatType to be used by this whitelist entry.

C.10.2.12 void MailShake::WhiteListEntry::setUId (uint uid) Sets an optional unique id for this WhiteListEntry. is can be used by the client implementation to identify an entry in the external storage. e id is by default set to 0.

Parameters: uid e unique identifier for this WhiteListEntry as set by the client implementation.

C.10.2.13 uint MailShake::WhiteListEntry::uId () const Returns: e unique identifier of this WhiteListEntry as set by the client implementation. In case the client implementation has not set an uid 0 will be returned.

e documentation for this class was generated from the following files:

• whitelistentry.h • whitelistentry.cpp

C.11 MailShake::WhiteListEntryPrivate Class Reference Public Attributes • boost::posix_time::ptime date • std::string header • std::string filter • bool addressFiltering • MatType mat • uint uid

e documentation for this class was generated from the following files:

• private/whitelistentry_p.h • private/whitelistentry_p.cpp 135

D Mailman Archive Address Harvester

D.1 main.cpp

1 # include "mailmanharvester.h" 2 # include 3 # include 4 # include 5 # include 6 7 static const char description[] = 8 I18N_NOOP("A KDE 4 Application"); 9 10 static const char version[] = "%{VERSION}"; 11 12 int main(int argc, char **argv) 13 { 14 KAboutData about("mailmanharvester", 0, ki18n("MailmanHarvester"), version, ki18n(description), 15 KAboutData::License_GPL, ki18n("(C) 2009 Martin Graesslin"), 16 KLocalizedString(), 0, "[email protected]"); 17 about.addAuthor( ki18n("Martin Graesslin"), KLocalizedString(), "[email protected]" ); 18 KCmdLineArgs::init(argc, argv, &about); 19 20 KCmdLineOptions options; 21 options.add("+[URL]", ki18n( "Document to open" )); 22 KCmdLineArgs::addCmdLineOptions(options); 23 KApplication app; 24 25 MailmanHarvester *widget = new MailmanHarvester; 26 27 // see if we are starting with session management 28 if (app.isSessionRestored()) 29 { 30 RESTORE(MailmanHarvester); 31 } 32 else 33 { 34 // no session.. just start up normally 35 KCmdLineArgs *args = KCmdLineArgs::parsedArgs(); 36 if (args->count() == 0) 37 { 38 //mailmanharvester *widget = new mailmanharvester; 39 widget->show(); 40 } 41 else 42 { 43 int i = 0; 44 for (; i < args->count(); i++) 45 { 46 //mailmanharvester *widget = new mailmanharvester; 47 widget->show(); 48 } 49 } 50 args->clear(); 51 } 52 53 return app.exec(); 54 } D.2 mailmanharvester.h

1 # ifndef MAILMANHARVESTER_H 2 # define MAILMANHARVESTER_H 3 4 # include 5 6 # include "ui_prefs_base.h" 7 # include 8 136 D Mailman Arive Address Harvester

9 class QWebPage; 10 class MailmanHarvesterView; 11 12 class MailmanHarvester : public KXmlGuiWindow 13 { 14 Q_OBJECT 15 public: 16 MailmanHarvester(); 17 virtual ~MailmanHarvester(); 18 19 private slots: 20 void execute(QString address); 21 void executeArchive(QString address); 22 void slotLoadFinished(bool ok); 23 void slotThreadLoadFinished(bool ok); 24 25 private: 26 void setupActions(); 27 28 private: 29 MailmanHarvesterView *m_view; 30 QWebPage *m_page; 31 QWebPage *m_threadPage; 32 QQueue m_addressQueue; 33 }; 34 35 # endif // _MAILMANHARVESTER_H_ D.3 mailmanharvester.cpp

1 # include "mailmanharvester.h" 2 # include "mailmanharvesterview.h" 3 4 # include 5 # include 6 # include 7 8 # include 9 # include 10 # include 11 12 MailmanHarvester::MailmanHarvester() 13 : KXmlGuiWindow() 14 , m_view(new MailmanHarvesterView(this)) 15 , m_page(new QWebPage()) 16 , m_threadPage(new QWebPage()) 17 { 18 setCentralWidget(m_view); 19 setupActions(); 20 setupGUI(); 21 connect(m_view, SIGNAL(execute(QString)), this, SLOT(execute(QString))); 22 connect(m_view, SIGNAL(executeArchive(QString)), this, SLOT(executeArchive(QString))); 23 connect(m_page, SIGNAL(loadFinished(bool)), this, SLOT(slotLoadFinished(bool))); 24 connect(m_threadPage, SIGNAL(loadFinished(bool)), this, SLOT(slotThreadLoadFinished(bool))); 25 } 26 27 MailmanHarvester::~MailmanHarvester() 28 { 29 } 30 31 void MailmanHarvester::setupActions() 32 { 33 KStandardAction::quit(qApp, SLOT(closeAllWindows()), actionCollection()); 34 } 35 36 void MailmanHarvester::executeArchive(QString address) 37 { 38 m_threadPage->mainFrame()->load(QUrl(address)); 39 } 40 41 void MailmanHarvester::execute(QString address) 42 { 43 m_addressQueue.enqueue(address); 44 KUrl url(address); 45 if (m_addressQueue.size() == 1){ 46 m_page->mainFrame()->load(url); 47 m_view->setBusy(true); 48 } 49 } 50 51 void MailmanHarvester::slotLoadFinished(bool ok) 52 { 53 if (!ok) { 54 m_addressQueue.dequeue(); D.4 mailmanharvesterview.h 137

55 return; 56 } 57 QWebElement doc = m_page->mainFrame()->documentElement(); 58 foreach (const QWebElement &element, doc.findAll("a")) { 59 if (element.attribute("href").contains("mailto")) { 60 // pipermail 61 QString text = element.toPlainText(); 62 QStringList elements = text.split(" at "); 63 if (elements.count() == 2){ 64 QString address = elements[0] + "@" + elements[1]; 65 m_view->addAddress(address); 66 } 67 } else { 68 // marc mail - no mailto 69 QRegExp rx(".*<(.*\\s\\(\\) .*)>"); 70 if (rx.exactMatch(element.toPlainText())) { 71 QString address = rx.capturedTexts()[1]; 72 address.replace(" () ", "@"); 73 address.replace(" ! ", "."); 74 m_view->addAddress(address); 75 } 76 } 77 } 78 m_addressQueue.dequeue(); 79 if (!m_addressQueue.isEmpty()) { 80 m_page->mainFrame()->load(QUrl(m_addressQueue.head())); 81 } else { 82 m_view->setBusy(false); 83 } 84 } 85 86 void MailmanHarvester::slotThreadLoadFinished(bool ok) 87 { 88 if (!ok) { 89 return; 90 } 91 bool load = m_addressQueue.isEmpty(); 92 QWebElement doc = m_threadPage->mainFrame()->documentElement(); 93 foreach (const QWebElement &element, doc.findAll("a")) { 94 QString attribute = element.attribute("HREF"); 95 if (!attribute.isEmpty()) { 96 QUrl url = m_threadPage->mainFrame()->url(); 97 QString urlString = url.toString(); 98 if (!urlString.endsWith("/")) { 99 urlString += ’/’; 100 } 101 urlString += attribute; 102 m_addressQueue.enqueue(urlString); 103 } 104 } 105 if (load && !m_addressQueue.isEmpty()) { 106 m_view->setBusy(true); 107 m_page->mainFrame()->load(QUrl(m_addressQueue.head())); 108 } 109 } 110 111 # include "mailmanharvester.moc" D.4 mailmanharvesterview.h

1 # ifndef MAILMANHARVESTERVIEW_H 2 # define MAILMANHARVESTERVIEW_H 3 4 # include 5 6 # include "ui_mailmanharvesterview_base.h" 7 8 class MailmanHarvesterView : public QWidget, public Ui::mailmanharvesterview_base 9 { 10 Q_OBJECT 11 public: 12 MailmanHarvesterView(QWidget *parent); 13 virtual ~MailmanHarvesterView(); 14 15 void addAddress(const QString &address); 16 void setBusy(bool busy); 17 18 private slots: 19 void slotExecuteClicked(); 20 void slotExecuteArchiveClicked(); 21 22 signals: 23 void execute(QString address); 24 void executeArchive(QString address); 138 D Mailman Arive Address Harvester

25 26 private: 27 Ui::mailmanharvesterview_base ui_mailmanharvesterview_base; 28 }; 29 30 # endif // MailmanHarvesterVIEW_H D.5 mailmanharvesterview.cpp

1 # include "mailmanharvesterview.h" 2 3 MailmanHarvesterView::MailmanHarvesterView(QWidget *) 4 { 5 ui_mailmanharvesterview_base.setupUi(this); 6 setAutoFillBackground(true); 7 connect(ui_mailmanharvesterview_base.executeButton, SIGNAL(clicked(bool)), this, 8 SLOT(slotExecuteClicked())); 9 connect(ui_mailmanharvesterview_base.executeArchiveButton, SIGNAL(clicked(bool)), this, 10 SLOT(slotExecuteArchiveClicked())); 11 ui_mailmanharvesterview_base.dateEdit->setDate(QDate::currentDate()); 12 } 13 14 MailmanHarvesterView::~MailmanHarvesterView() 15 { 16 } 17 18 void MailmanHarvesterView::slotExecuteClicked() 19 { 20 emit execute(ui_mailmanharvesterview_base.klineedit->text()); 21 } 22 23 void MailmanHarvesterView::addAddress(const QString& address) 24 { 25 if (ui_mailmanharvesterview_base.resultListWidget->findItems(address, Qt::MatchFixedString).isEmpty()) { 26 ui_mailmanharvesterview_base.resultListWidget->addItem(address); 27 } 28 } 29 30 void MailmanHarvesterView::setBusy(bool busy) 31 { 32 if (busy) { 33 setCursor(Qt::BusyCursor); 34 } else { 35 ui_mailmanharvesterview_base.resultListWidget->sortItems(); 36 unsetCursor(); 37 } 38 setEnabled(!busy); 39 } 40 41 void MailmanHarvesterView::slotExecuteArchiveClicked() 42 { 43 QString address = ui_mailmanharvesterview_base.pipermailLineEdit->text(); 44 if (!address.endsWith("/")) { 45 address += "/"; 46 } 47 address += ui_mailmanharvesterview_base.dateEdit->dateTime().toString("yyyy-"); 48 switch (ui_mailmanharvesterview_base.dateEdit->date().month()) { 49 case 1: 50 address += "January"; 51 break; 52 case 2: 53 address += "February"; 54 break; 55 case 3: 56 address += "March"; 57 break; 58 case 4: 59 address += "April"; 60 break; 61 case 5: 62 address += "May"; 63 break; 64 case 6: 65 address += "June"; 66 break; 67 case 7: 68 address += "July"; 69 break; 70 case 8: 71 address += "August"; 72 break; 73 case 9: 74 address += "September"; 75 break; D.6 mailmanharvesterviewbase.ui 139

76 case 10: 77 address += "October"; 78 break; 79 case 11: 80 address += "November"; 81 break; 82 case 12: 83 address += "December"; 84 break; 85 } 86 emit executeArchive(address); 87 } 88 89 # include "mailmanharvesterview.moc" D.6 mailmanharvesterviewbase.ui

1 2 3 mailmanharvesterview_base 4 5 6 7 0 8 0 9 590 10 381 11 12 13 14 kapp4_base 15 16 17 true 18 19 20 21 11 22 23 24 6 25 26 27 28 29 URL to email: 30 31 32 klineedit 33 34 35 36 37 38 39 40 41 42 43 44 45 Pipermail: 46 47 48 49 50 51 52 53 54 55 56 57 58 QDateTimeEdit::MonthSection 59 60 61 MMMM yyyy 62 63 64 65 66 67 140 D Mailman Arive Address Harvester

Figure D.1: Application for extracting addresses from Mailman arives

68 Execute 69 70 71 72 73 74 75 Execute 76 77 78 79 80 81 82 83 KPushButton 84 QPushButton 85

kpushbutton.h
86 87 88 KLineEdit 89 QLineEdit 90
klineedit.h
91
92 93 94 95 141

E Automated Scr.im CAPTCHA Solver

E.1 main.cpp

1 # include 2 # include "ScrimCracker.h" 3 4 int main(int argc, char** argv) 5 { 6 QApplication app(argc, argv); 7 const QStringList arguments = app.arguments(); 8 if (arguments.count() != 3){ 9 return 0; 10 } 11 ScrimCracker cracker(arguments.at(1), arguments.at(2).toInt()); 12 cracker.show(); 13 return app.exec(); 14 } E.2 ScrimCracker.h

1 # ifndef SCRIMCRACKER_H 2 # define SCRIMCRACKER_H 3 4 # include 5 # include 6 # include 7 8 class QProgressBar; 9 class QGLShaderProgram; 10 class QWebPage; 11 class QGLFramebufferObject; 12 class QGLWidget; 13 class QTableView; 14 class CrackerModelData; 15 16 /** 17 * @short TableModel containing the data from the runs. 18 */ 19 class CrackerModel : public QAbstractTableModel 20 { 21 Q_OBJECT 22 public: 23 enum ColumnNames { 24 CaptchaColumn = 0, 25 TokenColumn = 1, 26 OcrColumn = 2, 27 ErrorColumn = 3, 28 SuccessColumn = 4, 29 CaptchaValue1Column = 5, 30 CaptchaValue2Column = 6, 31 CaptchaValue3Column = 7, 32 CaptchaValue4Column = 8, 33 CaptchaValue5Column = 9, 34 CaptchaValue6Column = 10, 35 CaptchaValue7Column = 11, 36 CaptchaValue8Column = 12, 37 CaptchaValue9Column = 13, 38 ColumnCount = 14 39 }; 40 enum CaptchaStatus { 41 CaptchaProcessing = 0, 42 CaptchaSuccess = 1, 43 CaptchaFailure = 2, 44 CaptchaNotTried = 3 45 }; 46 CrackerModel(QObject* parent = 0); 47 virtual int columnCount(const QModelIndex& parent = QModelIndex()) const; 48 virtual QVariant data(const QModelIndex& index, int role = Qt::DisplayRole) const; 142 E Automated Scr.im CAPTCHA Solver

49 virtual int rowCount(const QModelIndex& parent = QModelIndex()) const; 50 /** 51 * Sets the success state for the last entry in the model. 52 */ 53 void setLastSuccess(int numberErrors, CaptchaStatus status); 54 /** 55 * Sets the CAPTCHA image of the last entry in the model. 56 */ 57 void setLastImage(QImage image); 58 /** 59 * Sets the ocr result of the last entry in the model. 60 */ 61 void setLastOcr(QString ocr); 62 /** 63 * Adds a new row for a new run to the model. 64 * @param token The internal token as used by Scr.im 65 * @param possibleCaptchas The list of possible results as offered by Scr.im 66 */ 67 void addCaptcha(const QString &token, const QStringList &possibleCaptchas); 68 69 private: 70 QList m_data; 71 }; 72 73 class CrackerModelData { 74 public: 75 QPixmap captcha; 76 QString token; 77 QString ocr; 78 QStringList captchaValues; 79 int numberErrors; 80 CrackerModel::CaptchaStatus status; 81 }; 82 83 /** 84 * @short Delegate for the table painting the CAPTCHA, everything else uses normal delegate. 85 */ 86 class CrackerItemDelegate : public QItemDelegate 87 { 88 Q_OBJECT 89 public: 90 CrackerItemDelegate(QObject* parent = 0); 91 virtual void paint(QPainter* painter, const QStyleOptionViewItem& option, const QModelIndex& index) const; 92 virtual QSize sizeHint(const QStyleOptionViewItem& option, const QModelIndex& index) const; 93 }; 94 95 /** 96 * @short The widget handling the runs to break Scr.im CAPTCHAs. 97 * 98 * This class manages a number of runs to break Scr.im CAPTCHAs. It uses OpenGL to extract the 99 * characters from the CAPTCHAs. That’s why it has to be a widget and not just a QObject. As it is 100 * a widget it also holds a progress bar and a table view with the results of the runs. 101 */ 102 class CrackerWidget : public QWidget 103 { 104 Q_OBJECT 105 public: 106 CrackerWidget(QString url, int runs, QWidget* parent = 0, Qt::WindowFlags f = 0); 107 ~CrackerWidget(); 108 109 private slots: 110 void loadFinished(bool ok); 111 void ocr(); 112 113 private: 114 void paintGl(const QImage &image); 115 void nextRun(); 116 QWebPage *m_page; 117 QWebPage *m_ocrResult; 118 QString m_imageUrl; 119 QString m_url; 120 QString m_token; 121 QString m_extractedText; 122 QStringList m_possibleCaptchas; 123 QGLWidget *m_glWidget; 124 QGLFramebufferObject *m_renderBuffer; 125 QGLShaderProgram *m_shader; 126 int m_errorCount; 127 128 QProgressBar *m_progress; 129 QTableView *m_table; 130 CrackerModel *m_model; 131 int m_numberRuns; 132 }; 133 134 class ScrimCracker : public QMainWindow E.3 ScrimCraer.cpp 143

135 { 136 Q_OBJECT 137 public: 138 ScrimCracker(QString url, int runs); 139 virtual ~ScrimCracker(); 140 }; 141 142 # endif // ScrimCracker_H E.3 ScrimCracker.cpp

1 # include "ScrimCracker.h" 2 3 # include 4 # include 5 # include 6 # include 7 # include 8 # include 9 # include 10 # include 11 # include 12 13 CrackerModel::CrackerModel(QObject* parent) 14 : QAbstractTableModel(parent) 15 { 16 } 17 18 int CrackerModel::columnCount(const QModelIndex& parent) const 19 { 20 return ColumnCount; 21 } 22 23 int CrackerModel::rowCount(const QModelIndex& parent) const 24 { 25 return m_data.count(); 26 } 27 28 QVariant CrackerModel::data(const QModelIndex& index, int role) const 29 { 30 if (role != Qt::DisplayRole) { 31 return QVariant(); 32 } 33 34 if (index.row() > m_data.count()) { 35 return QVariant(); 36 } 37 38 const CrackerModelData &datum = m_data.at(index.row()); 39 switch (index.column()) { 40 case CaptchaColumn: 41 return datum.captcha; 42 case TokenColumn: 43 return datum.token; 44 case CaptchaValue1Column: 45 return datum.captchaValues[0]; 46 case CaptchaValue2Column: 47 return datum.captchaValues[1]; 48 case CaptchaValue3Column: 49 return datum.captchaValues[2]; 50 case CaptchaValue4Column: 51 return datum.captchaValues[3]; 52 case CaptchaValue5Column: 53 return datum.captchaValues[4]; 54 case CaptchaValue6Column: 55 return datum.captchaValues[5]; 56 case CaptchaValue7Column: 57 return datum.captchaValues[6]; 58 case CaptchaValue8Column: 59 return datum.captchaValues[7]; 60 case CaptchaValue9Column: 61 return datum.captchaValues[8]; 62 case ErrorColumn: 63 if (datum.numberErrors == -1 && datum.status == CaptchaProcessing) { 64 // not yet solved 65 return QString("currently processed"); 66 } 67 else { 68 return QString::number(datum.numberErrors+1); 69 } 70 case SuccessColumn: 71 switch (datum.status) { 72 case CaptchaProcessing: 73 return QString("currently processed"); 144 E Automated Scr.im CAPTCHA Solver

74 case CaptchaSuccess: 75 return QString("Success"); 76 case CaptchaFailure: 77 return QString("Failure"); 78 case CaptchaNotTried: 79 return QString("Untested"); 80 default: 81 return QVariant(); 82 } 83 case OcrColumn: 84 if (datum.numberErrors == -1 && datum.status == CaptchaProcessing) { 85 // not yet solved 86 return QString("currently processed"); 87 } else { 88 return datum.ocr; 89 } 90 default: 91 return QVariant(); 92 } 93 } 94 95 void CrackerModel::setLastSuccess(int numberErrors, CrackerModel::CaptchaStatus status) 96 { 97 m_data.last().numberErrors = numberErrors; 98 m_data.last().status = status; 99 QModelIndex changedIndex = index(m_data.count()-1, SuccessColumn); 100 emit dataChanged(changedIndex, changedIndex); 101 } 102 103 void CrackerModel::setLastImage(QImage image) 104 { 105 m_data.last().captcha = QPixmap::fromImage(image); 106 QModelIndex changedIndex = index(m_data.count()-1, CaptchaColumn); 107 emit dataChanged(changedIndex, changedIndex); 108 } 109 110 void CrackerModel::setLastOcr(QString ocr) 111 { 112 m_data.last().ocr = ocr; 113 QModelIndex changedIndex = index(m_data.count()-1, OcrColumn); 114 emit dataChanged(changedIndex, changedIndex); 115 } 116 117 void CrackerModel::addCaptcha(const QString& token, const QStringList& possibleCaptchas) 118 { 119 beginInsertRows(QModelIndex(), m_data.count(), m_data.count()); 120 CrackerModelData datum; 121 datum.token = token; 122 datum.captchaValues = possibleCaptchas; 123 datum.numberErrors = -1; 124 datum.status = CaptchaProcessing; 125 m_data.append(datum); 126 endInsertRows(); 127 } 128 129 CrackerItemDelegate::CrackerItemDelegate(QObject* parent) 130 : QItemDelegate(parent) 131 { 132 } 133 134 void CrackerItemDelegate::paint(QPainter* painter, const QStyleOptionViewItem& option, const QModelIndex& index) const 135 { 136 if (index.column() != CrackerModel::CaptchaColumn) { 137 QItemDelegate::paint(painter, option, index); 138 } 139 QPixmap pixmap = index.model()->data(index).value(); 140 if (pixmap.isNull()) { 141 return; 142 } 143 painter->drawPixmap(QRect(option.rect.topLeft(), pixmap.size()), pixmap, QRect(QPoint(0,0), pixmap.size())); 144 } 145 146 QSize CrackerItemDelegate::sizeHint(const QStyleOptionViewItem& option, const QModelIndex& index) const 147 { 148 if (index.column() != CrackerModel::CaptchaColumn) { 149 return QItemDelegate::sizeHint(option, index); 150 } 151 return QSize(120, 40); 152 } 153 154 CrackerWidget::CrackerWidget(QString url, int runs, QWidget* parent, Qt::WindowFlags f) 155 : QWidget(parent, f) 156 , m_page(new QWebPage(this)) 157 , m_ocrResult(new QWebPage(this)) 158 , m_errorCount(0) 159 , m_url(url) E.3 ScrimCraer.cpp 145

160 , m_numberRuns(runs) 161 { 162 m_glWidget = new QGLWidget(QGLFormat(QGL::SampleBuffers|QGL::AlphaChannel), this); 163 m_glWidget->hide(); 164 m_glWidget->makeCurrent(); 165 m_renderBuffer = new QGLFramebufferObject(120, 40); 166 167 m_shader = new QGLShaderProgram(this); 168 m_shader->addShaderFromSourceCode(QGLShader::Vertex, 169 "void main()\n" 170 "{\n" 171 "gl_TexCoord[0] = gl_MultiTexCoord0;\n" 172 "gl_Position = ftransform();\n" 173 "}\n"); 174 m_shader->addShaderFromSourceCode(QGLShader::Fragment, 175 "uniform sampler2D texture;\n" 176 "void main()\n" 177 "{\n" 178 " vec4 color = texture2D(texture, gl_TexCoord[0].st);\n" 179 // red characters are not recognized, so remove red component 180 " if (color.r > 0.6)\n" 181 " color.r = 0.0;\n" 182 // set all bright pixels to white and dark pixels to black - result: the characters 183 " if (color.r + color.g + color.b > 1.1)\n" 184 " gl_FragColor = vec4(1.0, 1.0, 1.0, 1.0);\n" 185 " else\n" 186 " gl_FragColor = vec4(0.0, 0.0, 0.0, 1.0);\n" 187 "}\n"); 188 m_shader->link(); 189 m_shader->bind(); 190 m_shader->setUniformValue("texture", 0); 191 m_shader->release(); 192 m_glWidget->doneCurrent(); 193 194 m_progress = new QProgressBar(this); 195 m_progress->setMinimum(0); 196 m_progress->setMaximum(0); 197 m_table = new QTableView(this); 198 m_model = new CrackerModel(this); 199 m_table->setModel(m_model); 200 m_table->setItemDelegate(new CrackerItemDelegate(this)); 201 m_table->setColumnWidth(0, 120); 202 connect(m_model, SIGNAL(dataChanged(QModelIndex,QModelIndex)), m_table, SLOT(resizeRowsToContents())); 203 QVBoxLayout *layout = new QVBoxLayout(this); 204 layout->addWidget(m_progress); 205 layout->addWidget(m_table); 206 setLayout(layout); 207 208 connect(m_page, SIGNAL(loadFinished(bool)), SLOT(loadFinished(bool))); 209 nextRun(); 210 } 211 212 CrackerWidget::~CrackerWidget() 213 { 214 } 215 216 void CrackerWidget::nextRun() 217 { 218 if (m_numberRuns > 0){ 219 --m_numberRuns; 220 m_errorCount = 0; 221 m_page->mainFrame()->load(QUrl(m_url)); 222 } else { 223 m_progress->hide(); 224 } 225 } 226 227 void CrackerWidget::loadFinished(bool ok) 228 { 229 if (!ok) { 230 return; 231 } 232 if (m_page->mainFrame()->url().toString().startsWith("http://scr.im/cap/")) { 233 QImage image(120, 40, QImage::Format_ARGB32); 234 image.fill(Qt::transparent); 235 QPainter painter(&image); 236 QWebElement element = m_page->mainFrame()->findFirstElement("img"); 237 element.setAttribute("width", "120"); 238 element.setAttribute("height", "40"); 239 element.render(&painter); 240 painter.end(); 241 m_model->setLastImage(image); 242 paintGl(image); 243 QTimer::singleShot(0, this, SLOT(ocr())); 244 return; 245 } 146 E Automated Scr.im CAPTCHA Solver

246 // try to find the address 247 QWebElement address = m_page->mainFrame()->findFirstElement("div.clearfix p a"); 248 if (address.hasAttribute("href") && address.attribute("href").startsWith("mailto:")) { 249 qDebug() << address.toPlainText(); 250 m_model->setLastSuccess(m_errorCount, CrackerModel::CaptchaSuccess); 251 nextRun(); 252 return; 253 } 254 // error page? 255 QWebElement error = m_page->mainFrame()->findFirstElement("div.clearfix p a"); 256 if (error.hasAttribute("href") && 257 error.toPlainText().compare("try again", Qt::CaseInsensitive) == 0){ 258 qDebug() << "Incorrect CAPTCHA"; 259 m_model->setLastSuccess(m_errorCount, CrackerModel::CaptchaFailure); 260 nextRun(); 261 return; 262 } 263 264 // extract captcha information 265 QWebElementCollection inputElements = m_page->mainFrame()->findAllElements("input"); 266 foreach (QWebElement element, inputElements) { 267 if ((element.attribute("type").compare("hidden", Qt::CaseInsensitive) == 0) && 268 (element.attribute("name")).compare("token", Qt::CaseInsensitive) == 0){ 269 m_token = element.attribute("value"); 270 break; 271 } 272 } 273 274 QWebElementCollection liElements = m_page->mainFrame()->findAllElements("div.clearfix ul li"); 275 if (liElements.count() != 9){ 276 return; 277 } 278 m_possibleCaptchas.clear(); 279 for (int i=0; i<9; i++) { 280 m_possibleCaptchas.append(liElements[i].toPlainText()); 281 } 282 m_model->addCaptcha(m_token, m_possibleCaptchas); 283 QWebElement img = m_page->mainFrame()->findFirstElement("div.clearfix p img"); 284 if (img.hasAttribute("src") && img.attribute("src").startsWith("http://scr.im/cap/")) { 285 // for some unknown reason the element isn’t rendered into the pixmap 286 // therefore we just load the image into the frame and render the frame after next load 287 m_imageUrl = img.attribute("src"); 288 m_page->mainFrame()->load(m_imageUrl); 289 } 290 } 291 292 void CrackerWidget::paintGl(const QImage& image) 293 { 294 m_glWidget->makeCurrent(); 295 glPushAttrib(GL_ALL_ATTRIB_BITS); 296 m_renderBuffer->bind(); 297 glClearColor(1.0, 1.0, 1.0, 1.0); 298 glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); 299 glMatrixMode(GL_PROJECTION); 300 glLoadIdentity(); 301 glOrtho(0, image.width(), image.height(), 0,-1, 1); 302 glMatrixMode(GL_MODELVIEW); 303 glLoadIdentity(); 304 glViewport(0, 0, image.width(), image.height()); 305 306 GLuint texture = m_glWidget->bindTexture(image); 307 m_shader->bind(); 308 m_renderBuffer->drawTexture(QPointF(0.0, 0.0), texture); 309 m_shader->release(); 310 311 m_renderBuffer->release(); 312 m_glWidget->deleteTexture(texture); 313 QImage result = m_renderBuffer->toImage(); 314 result.save("/tmp/" + m_token + ".png"); 315 m_glWidget->doneCurrent(); 316 } 317 318 void CrackerWidget::ocr() 319 { 320 const QString path("/tmp/" + m_token + ".png"); 321 QProcess ocroscript; 322 ocroscript.start("ocroscript", QStringList() << "rec-tess" << path ); 323 if (!ocroscript.waitForStarted()) { 324 return; 325 } 326 if (!ocroscript.waitForFinished()) { 327 return; 328 } 329 QByteArray result = ocroscript.readAllStandardOutput(); 330 m_ocrResult->mainFrame()->setHtml(QString::fromLatin1(result)); 331 QWebElement ocrElement = m_ocrResult->mainFrame()->findFirstElement("span.ocr_line"); E.4 CMakeLists.txt 147

332 m_extractedText = ocrElement.toPlainText(); 333 m_model->setLastOcr(m_extractedText); 334 m_extractedText.remove(’’); 335 QString asUpper = m_extractedText.toUpper(); 336 QString bestMatch; 337 int length = 0; 338 foreach (const QString &test, m_possibleCaptchas) { 339 int currentLength = 0; 340 for (int i=0; i<5; i++) { 341 if ( i == test.size() || i == asUpper.size() ) { 342 break; 343 } 344 if (test.at(i) == asUpper.at(i)) { 345 ++currentLength; 346 } 347 } 348 if (currentLength > length) { 349 length = currentLength; 350 bestMatch = test; 351 } 352 } 353 qDebug() << "Best matching token: " << bestMatch << "with " << length << " matching characters"; 354 if (length >= 3){ 355 // three matching characters - we probably have a match 356 QByteArray body; 357 body.append("action=view&token="); 358 body.append(m_token.toLatin1()); 359 body.append("&captcha="); 360 body.append(bestMatch.toLatin1()); 361 m_page->mainFrame()->load(QNetworkRequest(m_url), 362 QNetworkAccessManager::PostOperation, 363 body); 364 } else if (m_errorCount < 5){ 365 // try again 366 m_page->mainFrame()->load(m_imageUrl); 367 ++m_errorCount; 368 } else { 369 m_model->setLastSuccess(m_errorCount, CrackerModel::CaptchaNotTried); 370 nextRun(); 371 } 372 } 373 374 ScrimCracker::ScrimCracker(QString url, int runs) 375 { 376 CrackerWidget *widget = new CrackerWidget(url, runs, this); 377 setCentralWidget(widget); 378 } 379 380 ScrimCracker::~ScrimCracker() 381 {} 382 383 # include "ScrimCracker.moc" E.4 CMakeLists.txt

1 project(ScrimCracker) 2 cmake_minimum_required(VERSION 2.6) 3 find_package(Qt4 REQUIRED) 4 5 include_directories(${QT_INCLUDES} ${CMAKE_CURRENT_BINARY_DIR}) 6 7 set(ScrimCracker_SRCS ScrimCracker.cpp main.cpp) 8 qt4_automoc(${ScrimCracker_SRCS}) 9 add_executable(ScrimCracker ${ScrimCracker_SRCS}) 10 target_link_libraries(ScrimCracker 11 ${QT_QTCORE_LIBRARY} 12 ${QT_QTGUI_LIBRARY} 13 ${QT_QTOPENGL_LIBRARY} 14 ${QT_QTWEBKIT_LIBRARY})

149

F Dialog to Solve a Mail-Shake Challenge

F.1 mailshakedialog.h

1 # ifndef MAILODY_MAILSHAKEDIALOG_H 2 # define MAILODY_MAILSHAKEDIALOG_H 3 4 # include 5 # include 6 7 class QProgressBar; 8 class QRadioButton; 9 class QWebPage; 10 class QWebFrame; 11 class KTitleWidget; 12 class KLineEdit; 13 class KPushButton; 14 15 namespace Mailody { 16 17 class ImageWidget : public QWidget 18 { 19 public: 20 ImageWidget(QWidget* parent = 0, Qt::WindowFlags f = 0); 21 22 virtual void paintEvent(QPaintEvent* event); 23 virtual QSize sizeHint() const; 24 void setImage(const QPixmap& pixmap); 25 26 private: 27 QPixmap m_pixmap; 28 }; 29 30 class MailShakeCaptchaWidget : public QWidget 31 { 32 Q_OBJECT 33 public: 34 MailShakeCaptchaWidget(QWidget* parent = 0, Qt::WindowFlags f = 0); 35 36 /** 37 * Extracts the Captcha from the loaded web frame 38 * and displays it to the user. 39 * 40 * If the frame contains the result, the displayed CAPTCHA won’t be shown. 41 * The email address will be extracted and the signal captchaSolved will be emitted. 42 * @param frame The frame containing the loaded CAPTCHA website 43 * @returns true if the page was parsed successfully. 44 */ 45 virtual bool parseCaptcha(QWebFrame *frame) = 0; 46 /** 47 * @returns The data for the post operation 48 */ 49 virtual QByteArray solveCaptcha() = 0; 50 virtual QString address() const; 51 void showErrorMessage(); 52 void hideErrorMessage(); 53 virtual void setLoading(bool loading); 54 55 signals: 56 void captchaSolved(); 57 void incorrectCaptcha(); 58 59 protected: 60 KTitleWidget *m_errorMessage; 61 QProgressBar *m_progress; 62 QString m_address; 150 F Dialog to Solve a Mail-Shake Challenge

63 }; 64 65 class ReCAPTCHAWidget : public MailShakeCaptchaWidget 66 { 67 Q_OBJECT 68 public: 69 ReCAPTCHAWidget(QWidget* parent = 0, Qt::WindowFlags f = 0); 70 virtual bool parseCaptcha(QWebFrame* frame); 71 virtual QByteArray solveCaptcha(); 72 virtual void setLoading(bool loading); 73 74 private: 75 KLineEdit *m_lineEdit; 76 ImageWidget *m_captchaWidget; 77 QString m_captchaParameter; 78 }; 79 80 class ScrimWidget : public MailShakeCaptchaWidget 81 { 82 Q_OBJECT 83 public: 84 ScrimWidget(QWidget* parent = 0, Qt::WindowFlags f = 0); 85 virtual bool parseCaptcha(QWebFrame* frame); 86 virtual QByteArray solveCaptcha(); 87 virtual void setLoading(bool loading); 88 89 private: 90 QList m_buttons; 91 ImageWidget *m_captchaWidget; 92 QString m_token; 93 }; 94 95 class MailShakeDialog : public KDialog 96 { 97 Q_OBJECT 98 public: 99 MailShakeDialog(const QUrl& url, QWidget* parent = 0, Qt::WFlags flags = 0); 100 101 private slots: 102 void loadFinished(bool ok); 103 void reloadCaptcha(); 104 void solve(); 105 void incorrectCaptcha(); 106 107 private: 108 void setLoading(bool loading); 109 MailShakeCaptchaWidget *m_widget; 110 QWebPage *m_page; 111 QString m_id; 112 bool m_loaded; 113 QString m_url; 114 }; 115 116 } 117 118 # endif // MAILODY_MAILSHAKEDIALOG_H F.2 mailshakedialog.cpp

1 # include "mailshakedialog.h" 2 // Qt 3 # include 4 # include 5 # include 6 # include 7 # include 8 # include 9 # include 10 # include 11 # include 12 // KDE 13 # include 14 # include 15 # include 16 # include 17 # include 18 19 using namespace Mailody; 20 21 ImageWidget::ImageWidget(QWidget* parent, Qt::WindowFlags f) 22 : QWidget(parent, f) 23 { 24 } 25 F.2 mailshakedialog.cpp 151

26 void ImageWidget::setImage(const QPixmap& pixmap) 27 { 28 m_pixmap = pixmap; 29 setMinimumSize(m_pixmap.size()); 30 update(); 31 } 32 33 QSize ImageWidget::sizeHint() const 34 { 35 if (!m_pixmap.isNull()) { 36 return QSize( 0, 0 ); 37 } 38 return m_pixmap.size(); 39 } 40 41 void ImageWidget::paintEvent(QPaintEvent* event) 42 { 43 QWidget::paintEvent(event); 44 QPainter painter(this); 45 if (!m_pixmap.isNull()) { 46 painter.drawPixmap( QPoint(width()/2-m_pixmap.width()/2, height()/2-m_pixmap.height()/2), 47 m_pixmap ); 48 } 49 } 50 51 /************************************************ 52 * ScrimWidget 53 ************************************************/ 54 ScrimWidget::ScrimWidget(QWidget* parent, Qt::WindowFlags f) 55 : MailShakeCaptchaWidget(parent, f) 56 , m_captchaWidget(new ImageWidget(this)) 57 { 58 layout()->addWidget(new QLabel(i18n("Choose the characters matching those shown in the image."))); 59 layout()->addWidget(m_captchaWidget); 60 for (int i=0; i<9; ++i) { 61 QRadioButton *button = new QRadioButton(this); 62 layout()->addWidget(button); 63 m_buttons.append(button); 64 } 65 } 66 67 bool ScrimWidget::parseCaptcha(QWebFrame* frame) 68 { 69 if (frame->url().toString().startsWith("http://scr.im/cap/")) { 70 QPixmap pixmap(120, 40); 71 pixmap.fill(Qt::transparent); 72 QPainter painter(&pixmap); 73 frame->render(&painter); 74 painter.end(); 75 m_captchaWidget->setImage(pixmap); 76 return true; 77 } 78 // try to find the address 79 QWebElement address = frame->findFirstElement("div.clearfix p a"); 80 if (address.hasAttribute("href") && address.attribute("href").startsWith("mailto:")) { 81 m_address = address.toPlainText(); 82 KMessageBox::information(this, m_address); 83 emit captchaSolved(); 84 return true; 85 } 86 // error page? 87 QWebElement error = frame->findFirstElement("div.clearfix p a"); 88 if (error.hasAttribute("href") && 89 error.toPlainText().compare("try again", Qt::CaseInsensitive) == 0){ 90 emit incorrectCaptcha(); 91 return true; 92 } 93 94 // extract captcha information 95 QWebElementCollection inputElements = frame->findAllElements("input"); 96 foreach (QWebElement element, inputElements) { 97 if ((element.attribute("type").compare("hidden", Qt::CaseInsensitive) == 0) && 98 (element.attribute("name")).compare("token", Qt::CaseInsensitive) == 0){ 99 m_token = element.attribute("value"); 100 break; 101 } 102 } 103 104 QWebElementCollection liElements = frame->findAllElements("div.clearfix ul li"); 105 if (liElements.count() != 9){ 106 return false; 107 } 108 for (int i=0; i<9; i++) { 109 m_buttons[i]->setText(liElements[i].toPlainText()); 110 } 111 QWebElement img = frame->findFirstElement("div.clearfix p img"); 152 F Dialog to Solve a Mail-Shake Challenge

112 if (img.hasAttribute("src") && img.attribute("src").startsWith("http://scr.im/cap/")) { 113 // for some unknown reason the element isn’t rendered into the pixmap 114 // therefore we just load the image into the frame and render the frame after next load 115 frame->load(img.attribute("src")); 116 return false; 117 } 118 return true; 119 } 120 121 QByteArray ScrimWidget::solveCaptcha() 122 { 123 QByteArray body; 124 body.append("action=view&token="); 125 body.append(m_token.toLatin1()); 126 body.append("&captcha="); 127 QString text; 128 foreach (QRadioButton *button, m_buttons) { 129 if (button->isChecked()) { 130 text = button->text().remove(’&’); 131 break; 132 } 133 } 134 body.append(text.toLatin1()); 135 return body; 136 } 137 138 void ScrimWidget::setLoading(bool loading) 139 { 140 MailShakeCaptchaWidget::setLoading(loading); 141 foreach (QRadioButton *button, m_buttons) { 142 button->setDisabled(loading); 143 } 144 } 145 146 147 /************************************************ 148 * ReCAPTCHAWidget 149 ************************************************/ 150 ReCAPTCHAWidget::ReCAPTCHAWidget(QWidget* parent, Qt::WindowFlags f) 151 : MailShakeCaptchaWidget(parent, f) 152 , m_lineEdit(new KLineEdit(this)) 153 , m_captchaWidget(new ImageWidget(this)) 154 { 155 layout()->addWidget(new QLabel(i18n("Type in the two English words shown in the image."))); 156 layout()->addWidget(m_captchaWidget); 157 layout()->addWidget(m_lineEdit); 158 } 159 160 bool ReCAPTCHAWidget::parseCaptcha(QWebFrame* frame) 161 { 162 QWebElement img = frame->findFirstElement("td.recaptcha_image_cell div img"); 163 if (img.hasAttribute("src") && img.attribute("src").startsWith("http://api.recaptcha.net/image")) { 164 const int width = img.attribute("width").toInt(); 165 const int height = img.attribute("height").toInt(); 166 QPixmap pixmap(width, height); 167 pixmap.fill(Qt::transparent); 168 QPainter painter(&pixmap); 169 img.render(&painter); 170 m_captchaWidget->setImage(pixmap); 171 } 172 173 QWebElement hidden = frame->findFirstElement("div.recaptcha_input_area span input"); 174 if (hidden.attribute("id").startsWith("recaptcha_challenge_field")) { 175 m_captchaParameter = hidden.attribute("value"); 176 return true; 177 } 178 179 // test if we have the address 180 QWebElement address = frame->findFirstElement("div center b a"); 181 if (address.hasAttribute("href") && address.attribute("href").startsWith("mailto:")) { 182 m_address = address.toPlainText(); 183 KMessageBox::information(this, m_address); 184 emit captchaSolved(); 185 return true; 186 } 187 return false; 188 } 189 190 QByteArray ReCAPTCHAWidget::solveCaptcha() 191 { 192 QByteArray body; 193 body.append("recaptcha_challenge_field="); 194 body.append(m_captchaParameter.toLatin1()); 195 body.append("&recaptcha_response_field="); 196 body.append(m_lineEdit->text().toLatin1()); 197 return body; F.2 mailshakedialog.cpp 153

198 } 199 200 void ReCAPTCHAWidget::setLoading(bool loading) 201 { 202 MailShakeCaptchaWidget::setLoading(loading); 203 m_lineEdit->setDisabled(loading); 204 } 205 206 /************************************************ 207 * MailShakeCaptchaWidget 208 ************************************************/ 209 MailShakeCaptchaWidget::MailShakeCaptchaWidget(QWidget* parent, Qt::WindowFlags f) 210 : QWidget(parent, f) 211 , m_errorMessage(new KTitleWidget(this)) 212 , m_progress(new QProgressBar(this)) 213 { 214 m_errorMessage->setText(i18n("Your result was incorrect. Please try again"), 215 KTitleWidget::ErrorMessage); 216 m_progress->hide(); 217 m_errorMessage->hide(); 218 QVBoxLayout *layout = new QVBoxLayout(this); 219 layout->addWidget(m_progress); 220 layout->addWidget(m_errorMessage); 221 setLayout(layout); 222 } 223 224 QString MailShakeCaptchaWidget::address() const 225 { 226 return m_address; 227 } 228 229 void MailShakeCaptchaWidget::showErrorMessage() 230 { 231 m_errorMessage->show(); 232 } 233 234 void MailShakeCaptchaWidget::hideErrorMessage() 235 { 236 m_errorMessage->hide(); 237 } 238 239 void MailShakeCaptchaWidget::setLoading(bool loading) 240 { 241 if (loading) { 242 m_progress->setMinimum(0); 243 m_progress->setMaximum(0); 244 m_progress->reset(); 245 m_progress->show(); 246 } else { 247 m_progress->hide(); 248 } 249 } 250 251 /************************************************ 252 * MailShakeDialog 253 ************************************************/ 254 MailShakeDialog::MailShakeDialog(const QUrl& url, QWidget* parent, Qt::WFlags flags) 255 : KDialog(parent, flags) 256 , m_loaded(false) 257 { 258 // remove mailshake: 259 const QString mailShakeUrl = url.toString().mid(10); 260 const int index = mailShakeUrl.indexOf(’?’); 261 m_id = mailShakeUrl.left(index); 262 263 setButtons( User1 | User2 | Cancel ); 264 265 KPushButton *solveButton = button(User1); 266 solveButton->setText(i18n("Solve")); 267 solveButton->setIcon(KIcon("dialog-ok")); 268 269 KPushButton *reloadButton = button(User2); 270 reloadButton->setText(i18n("Reload Captcha")); 271 reloadButton->setIcon(KIcon("view-refresh")); 272 273 m_url = mailShakeUrl.mid(index+1); 274 if (m_url.startsWith("http://mailhide.recaptcha.net/")) { 275 m_widget = new ReCAPTCHAWidget(this); 276 } else if (m_url.startsWith("http://scr.im/")) { 277 m_widget = new ScrimWidget(this); 278 } 279 setMainWidget(m_widget); 280 m_page = new QWebPage(this); 281 connect(m_page, SIGNAL(loadFinished(bool)), SLOT(loadFinished(bool))); 282 connect(this, SIGNAL(user1Clicked()), SLOT(solve())); 283 connect(this, SIGNAL(user2Clicked()), SLOT(reloadCaptcha())); 154 F Dialog to Solve a Mail-Shake Challenge

284 connect(m_widget, SIGNAL(captchaSolved()), SLOT(accept())); 285 connect(m_widget, SIGNAL(incorrectCaptcha()), SLOT(incorrectCaptcha())); 286 m_widget->setLoading(true); 287 enableButton(User1, false); 288 m_page->mainFrame()->load(m_url); 289 } 290 291 void MailShakeDialog::loadFinished(bool ok) 292 { 293 if (!ok) { 294 return; 295 } 296 bool success = m_widget->parseCaptcha(m_page->mainFrame()); 297 if (m_loaded && success) { 298 m_widget->showErrorMessage(); 299 } else { 300 m_widget->hideErrorMessage(); 301 } 302 if (success) { 303 m_loaded = true; 304 setLoading(false); 305 } 306 } 307 308 void MailShakeDialog::solve() 309 { 310 m_page->mainFrame()->load(QNetworkRequest(m_url), 311 QNetworkAccessManager::PostOperation, 312 m_widget->solveCaptcha()); 313 setLoading(true); 314 } 315 316 void MailShakeDialog::reloadCaptcha() 317 { 318 m_loaded = false; 319 m_widget->hideErrorMessage(); 320 setLoading(true); 321 m_page->mainFrame()->load(m_url); 322 } 323 324 void MailShakeDialog::setLoading(bool loading) 325 { 326 m_widget->setLoading(loading); 327 enableButton(User1, !loading); 328 } 329 330 void MailShakeDialog::incorrectCaptcha() 331 { 332 m_loaded = false; 333 m_widget->showErrorMessage(); 334 setLoading(true); 335 m_page->mainFrame()->load(m_url); 336 } 155

G RSS Generator

G.1 main.cpp

1 # include "rssgenerator.h" 2 # include 3 # include 4 # include 5 # include 6 7 int main(int argc, char** argv) 8 { 9 QCoreApplication app(argc, argv); 10 const QStringList arguments = app.arguments(); 11 if (arguments.contains("--help")) { 12 // show the help list 13 std::cout << "Spam Templates RSS generator" << std::endl; 14 std::cout << "Usage: rssgenerator [OPTION]" << std::endl << std::endl; 15 std::cout << "--help Show this help text" << std::endl; 16 std::cout << "--dir= Generate RSS for all files in specified directory" << std::endl; 17 std::cout << "--templatefile= Generate RSS for specified template file" << std::endl; 18 std::cout << "--outputfile= RSS file, default templates.rss" << std::endl; 19 std::cout << "--rss= Append to the RSS file instead of generating a new one" << std::endl; 20 return 0; 21 } 22 QString dir; 23 QStringList templatefiles; 24 QString outputfile("template.rss"); 25 QString rss; 26 foreach (const QString &argument, arguments) { 27 if (argument.startsWith("--dir=")) { 28 dir = argument.mid(6); 29 continue; 30 } 31 if (argument.startsWith("--templatefile=")) { 32 templatefiles.append(argument.mid(15)); 33 continue; 34 } 35 if (argument.startsWith("--outputfile=")) { 36 outputfile = argument.mid(13); 37 continue; 38 } 39 if (argument.startsWith("--rss=")) { 40 rss = argument.mid(6); 41 continue; 42 } 43 } 44 45 SpamTemplates::RSSGenerator generator(dir, templatefiles, outputfile, rss); 46 QObject::connect(&generator, SIGNAL(finished()), &app, SLOT(quit())); 47 return app.exec(); 48 } G.2 rssgenerator.h

1 # ifndef SPAMTEMPLATS_RSSGENERATOR_H 2 # define SPAMTEMPLATS_RSSGENERATOR_H 3 4 # include 5 # include 6 # include 7 8 namespace SpamTemplates 9 { 10 11 class RSSGenerator 12 : public QObject 13 { 14 Q_OBJECT 156 G RSS Generator

15 public: 16 RSSGenerator(QString dir, QStringList templateFiles, QString outputFile, QString rss); 17 ~RSSGenerator(); 18 19 public slots: 20 void startProcessing(); 21 22 signals: 23 void finished(); 24 25 private: 26 void generateRSS(); 27 void readRSS(); 28 bool readTemplate(const QString &path); 29 void error(const char* message); 30 bool containsGUID(const QString &guid) const; 31 QString date(); 32 33 private: 34 QString m_dir; 35 QStringList m_templateFiles; 36 QString m_outputFile; 37 QString m_rss; 38 QDomDocument m_rssDoc; 39 }; 40 41 } // namespace SpamTemplates 42 43 # endif // SPAMTEMPLATS_RSSGENERATOR_H G.3 rssgenerator.cpp

1 # include "rssgenerator.h" 2 3 # include 4 5 # include 6 # include 7 # include 8 # include 9 # include 10 # include 11 12 namespace SpamTemplates 13 { 14 15 RSSGenerator::RSSGenerator(QString dir, QStringList templateFiles, QString outputFile, QString rss) 16 : QObject() 17 , m_dir(dir) 18 , m_templateFiles(templateFiles) 19 , m_outputFile(outputFile) 20 , m_rss(rss) 21 , m_rssDoc("rss") 22 { 23 QTimer::singleShot(0, this, SLOT(startProcessing())); 24 } 25 26 RSSGenerator::~RSSGenerator() 27 { 28 } 29 30 void RSSGenerator::startProcessing() 31 { 32 if (m_rss.isEmpty()) { 33 // generate the RSS 34 generateRSS(); 35 } else { 36 readRSS(); 37 } 38 if (!m_dir.isEmpty()) { 39 // append all files in the directory to the template files 40 QDir dir(m_dir); 41 if (!dir.exists()) { 42 error("specified directory does not exist"); 43 return; 44 } 45 foreach (const QString &fileName, dir.entryList(QDir::Files | QDir::Readable)) { 46 m_templateFiles.append(dir.absoluteFilePath(fileName)); 47 } 48 } 49 foreach (const QString &fileName, m_templateFiles) { 50 if (!readTemplate(fileName)) { 51 return; 52 } G.3 rssgenerator.cpp 157

53 } 54 // write back the document 55 QFile outputFile(m_outputFile); 56 if (!outputFile.open(QIODevice::WriteOnly)) { 57 error("Could not open output file for writing"); 58 return; 59 } 60 QTextStream out(&outputFile); 61 m_rssDoc.save(out, 4); 62 outputFile.close(); 63 emit finished(); 64 } 65 66 void RSSGenerator::generateRSS() 67 { 68 QString xml; 69 QXmlStreamWriter stream(&xml); 70 stream.setAutoFormatting(true); 71 stream.writeStartDocument(); 72 73 stream.writeStartElement("rss"); 74 stream.writeAttribute("version", "2.0"); 75 stream.writeStartElement("channel"); 76 stream.writeTextElement("title", "Spam Templates"); 77 stream.writeTextElement("link", "http://pi1.informatik.uni-mannheim.de"); 78 stream.writeTextElement("description", "RSS feed containing the Spam Templates for Spam filtering"); 79 // FIXME: not the correct time format 80 stream.writeTextElement("lastBuildDate", date()); 81 stream.writeEndElement(); // channel 82 stream.writeEndElement(); // rss 83 84 stream.writeEndDocument(); 85 if (!m_rssDoc.setContent(xml)) { 86 // error occurred 87 error("could not generate the RSS generic file"); 88 } 89 } 90 91 void RSSGenerator::readRSS() 92 { 93 QFile file(m_rss); 94 if (!file.exists()) { 95 error("RSS file does not exist"); 96 return; 97 } 98 if (!file.open(QIODevice::ReadOnly)) { 99 error("RSS file could not be opened"); 100 return; 101 } 102 if (!m_rssDoc.setContent(&file)) { 103 file.close(); 104 error("Could not parse RSS file"); 105 } 106 file.close(); 107 QDomNodeList elements = m_rssDoc.documentElement().elementsByTagName("lastBuildDate"); 108 if (elements.isEmpty() || elements.count() > 1){ 109 error("Malformed RSS file"); 110 return; 111 } 112 elements.at(0).toElement().firstChild().toText().setData(date()); 113 } 114 115 bool RSSGenerator::readTemplate(const QString& path) 116 { 117 QFile file(path); 118 if (!file.exists()) { 119 error("File does not exist"); 120 return false; 121 } 122 if (!file.open(QIODevice::ReadOnly)) { 123 error("Could not open file"); 124 return false; 125 } 126 const QByteArray content = file.readAll(); 127 file.close(); 128 const QString hash = QString::number(qHash(content)); 129 if (containsGUID(hash)) { 130 return true; 131 } 132 QString xml; 133 QXmlStreamWriter stream(&xml); 134 stream.setAutoFormatting(true); 135 stream.writeStartElement("item"); 136 stream.writeTextElement("title", hash); 137 stream.writeTextElement("link", "http://pi1.uni-mannheim.de/bla"); 138 stream.writeTextElement("description", content); 158 G RSS Generator

139 stream.writeTextElement("guid", hash); 140 stream.writeEndElement(); // item 141 142 QDomDocument part; 143 if (!part.setContent(xml)) { 144 error("Could not generate item content"); 145 return false; 146 } 147 QDomNodeList elements = m_rssDoc.documentElement().elementsByTagName("channel"); 148 if (elements.isEmpty() || elements.count() > 1){ 149 error("Malformed RSS file"); 150 return false; 151 } 152 elements.at(0).appendChild(part.documentElement()); 153 return true; 154 } 155 156 bool RSSGenerator::containsGUID(const QString& guid) const 157 { 158 QDomNodeList elements = m_rssDoc.documentElement().elementsByTagName("guid"); 159 for (int i=0; i

1 project(rssgenerator) 2 cmake_minimum_required(VERSION 2.6) 3 find_package(Qt4 REQUIRED) 4 5 include_directories(${QT_INCLUDES} ${CMAKE_CURRENT_BINARY_DIR}) 6 7 set(rssgenerator_SRCS rssgenerator.cpp main.cpp) 8 qt4_automoc(${rssgenerator_SRCS}) 9 add_executable(rssgenerator ${rssgenerator_SRCS}) 10 target_link_libraries(rssgenerator ${QT_QTCORE_LIBRARY} ${QT_QTXML_LIBRARY}) 159

H Spam Templates Library

H.1 template.h

1 # ifndef SPAMTEMPLATES_TEMPLATE_H 2 # define SPAMTEMPLATES_TEMPLATE_H 3 # include 4 # include 5 6 namespace SpamTemplates 7 { 8 // forward-declarations 9 class Mail; 10 class TemplatePrivate; 11 class TemplateManager; 12 13 /** 14 * @short Represents one Spam Template to match emails against. 15 * 16 * This class encapsulates one Spam Template. It consists of regular expressions to check the 17 * subject, X-Mailer Header and the body. The body consists of a list of regular expressions - each 18 * representing one line of the header. 19 * 20 * An email can be checked against the Template and a score will be determined. The algorithm tries 21 * to match as many lines of the body as possible in one go. The score depends on the number of 22 * matching lines, the number of matching lines in one sequence compared to the number of lines of 23 * the mail as well as of the template. Additionally the number of empty lines is compared. 24 * The score will give the caller the possibility to interpret if the mail is spam or not. 25 * 26 * @author Martin Graesslin 27 */ 28 class Template 29 { 30 public: 31 /** 32 * Default ctor allowing to set the fuzziness factor and the empty lines. 33 */ 34 Template(const TemplateManager &manager, int emptyLines = 0); 35 /** 36 * Ctor allowing to set all fields of the Spam Template. 37 */ 38 Template(const std::string &subject, const std::string &mailer, 39 const std::list &body, const TemplateManager &manager, int emptyLines = 0); 40 ~Template(); 41 42 /** 43 * Sets the number of empty lines which is another factor for calculating the score. 44 */ 45 void setEmptyLines(int count); 46 47 /** 48 * Sets the regular expression to be used to check a mail subject. 49 */ 50 void setSubject(const std::string &subject); 51 /** 52 * Sets the regular expression to be used to check the X-Mailer header. 53 */ 54 void setMailer(const std::string &mailer); 55 /** 56 * Sets the list of regular expressions to check the body of a mail. Each entry of the list 57 * represents one line of the body. 58 */ 59 void setBody(const std::list &body); 60 61 /** 62 * Checks the given mail against this Spam Template. It’s the task of the caller to 63 * interpret the returned score. 64 * @param mail The email which should be checked 65 * @returns The score indicating the likeliness of spam. 66 */ 67 int checkMail(const Mail &mail) const; 160 H Spam Templates Library

68 69 std::string subject() const; 70 std::string mailer() const; 71 std::list body() const; 72 73 private: 74 Template &operator=(const Template& t); 75 TemplatePrivate* const d; 76 }; 77 78 } // namespace SpamTemplates 79 80 # endif // SPAMTEMPLATES_TEMPLATE_H H.2 template.cpp

1 # include "template.h" 2 # include "templatemanager.h" 3 # include "mail.h" 4 5 # include 6 # include 7 8 template T max(T x, T y) { 9 if ( x < y ) { 10 return y; 11 } else { 12 return x; 13 } 14 } 15 16 namespace SpamTemplates 17 { 18 19 //************************************************************** 20 // TemplatePrivate 21 //*************************************************************/ 22 class TemplatePrivate { 23 public: 24 TemplatePrivate(const TemplateManager &m, int empty) 25 : emptyLines(empty) 26 , manager(m) { 27 } 28 TemplatePrivate(const std::string &s, const std::string &m, const TemplateManager &tm, int empty) 29 : subject(s) 30 , mailer(m) 31 , emptyLines(empty) 32 , manager(tm) { 33 } 34 boost::regex subject; 35 boost::regex mailer; 36 std::list body; 37 int emptyLines; 38 const TemplateManager &manager; 39 }; 40 41 //************************************************************** 42 // Template 43 //*************************************************************/ 44 Template::Template(const TemplateManager &manager, int emptyLines) 45 : d(new TemplatePrivate(manager, emptyLines)) 46 { 47 } 48 49 Template::Template(const std::string &subject, const std::string &mailer, 50 const std::list &body, const TemplateManager &manager, int emptyLines) 51 : d(new TemplatePrivate(subject, mailer, manager, emptyLines)) 52 { 53 setBody(body); 54 } 55 56 Template::~Template() 57 { 58 delete d; 59 } 60 61 Template& Template::operator=(const SpamTemplates::Template& /*t*/ ) 62 { 63 // copy disabled 64 return *this; 65 } 66 67 void Template::setEmptyLines(int count) 68 { H.2 template.cpp 161

69 d->emptyLines = count; 70 } 71 72 void Template::setBody(const std::list &body) 73 { 74 BOOST_FOREACH(const std::string &line, body) { 75 d->body.push_back(boost::regex(line)); 76 } 77 } 78 79 void Template::setMailer(const std::string &mailer) 80 { 81 d->mailer = boost::regex(mailer); 82 } 83 84 void Template::setSubject(const std::string &subject) 85 { 86 d->subject = boost::regex(subject); 87 } 88 89 std::list< std::string > Template::body() const 90 { 91 std::list ret; 92 BOOST_FOREACH (const boost::regex ®ex, d->body) { 93 ret.push_back(regex.str()); 94 } 95 return ret; 96 } 97 98 std::string Template::mailer() const 99 { 100 return d->mailer.str(); 101 } 102 103 std::string Template::subject() const 104 { 105 return d->subject.str(); 106 } 107 108 int Template::checkMail(const Mail &mail) const 109 { 110 int score = 0; 111 try { 112 if (!d->subject.empty() && boost::regex_match(mail.subject(), d->subject)) { 113 // subject matching 114 score += d->manager.subjectQuantification(); 115 } 116 } catch (std::runtime_error) { 117 // have to catch exception thrown by regex_match 118 } 119 try { 120 if (!d->mailer.empty() && boost::regex_match(mail.mailer(), d->mailer)) { 121 // mailer is matching 122 score += d->manager.mailerQuantification(); 123 } 124 } catch (std::runtime_error) { 125 // have to catch exception thrown by regex_match 126 } 127 int lineCount = 0; 128 int matchCount = 0; 129 // the maximum number of lines which were matched in one go 130 int maxMatches = 0; 131 int currentMaxMatches = 0; 132 const int lineSeek = (float)d->manager.fuzziness()/100.0*d->body.size(); 133 int emptyLineCounter = 0; 134 // try to match as many lines as possible in one go 135 std::list::const_iterator it = d->body.begin(); 136 BOOST_FOREACH (const std::string &line, mail.body()) { 137 if (line.empty()) { 138 // skip empty lines 139 ++emptyLineCounter; 140 continue; 141 } 142 try { 143 ++lineCount; 144 // does the current iterator match the line? 145 if (boost::regex_match(line, *it)) { 146 // matches 147 ++matchCount; 148 ++currentMaxMatches; 149 maxMatches = max(maxMatches, currentMaxMatches); 150 ++it; 151 if (it == d->body.end()) { 152 // back to begin of template 153 it = d->body.begin(); 154 currentMaxMatches = 0; 162 H Spam Templates Library

155 } 156 continue; 157 } else if (d->manager.fuzziness() > 0){ 158 // doesn’t match - try forward seek 159 std::list::const_iterator tit = it; 160 ++tit; 161 bool foundMatch = false; 162 for (int i=0; ibody.end(); ++tit) { 163 if (boost::regex_match(line, *tit)) { 164 // matches 165 foundMatch = true; 166 break; 167 } 168 } 169 if (!foundMatch) { 170 // did not find a match in forward seek - try backward seek 171 tit = it; 172 for (int i=0; ibody.begin(); --tit) { 173 if (boost::regex_match(line, *tit)) { 174 // matches 175 foundMatch = true; 176 break; 177 } 178 } 179 } 180 if (foundMatch) { 181 ++matchCount; 182 ++currentMaxMatches; 183 maxMatches = max(maxMatches, currentMaxMatches); 184 ++it; 185 if (it == d->body.end()) { 186 // back to begin of template 187 it = d->body.begin(); 188 currentMaxMatches = 0; 189 } 190 continue; 191 } 192 } 193 // so far no match - reset to beginning 194 currentMaxMatches = 0; 195 it = d->body.begin(); 196 while (it != d->body.end()) { 197 if (boost::regex_match(line, *it)) { 198 ++matchCount; 199 ++currentMaxMatches; 200 maxMatches = max(maxMatches, currentMaxMatches); 201 ++it; 202 if (it == d->body.end()) { 203 // back to begin of template 204 it = d->body.begin(); 205 currentMaxMatches = 0; 206 } 207 break; 208 } 209 ++it; 210 } 211 if (it == d->body.end()) { 212 // no line matched - reset 213 it = d->body.begin(); 214 } 215 } catch (std::runtime_error) { 216 // have to catch exception thrown by regex_match 217 ++it; 218 currentMaxMatches = 0; 219 if (it == d->body.end()) { 220 // back to begin of template 221 it = d->body.begin(); 222 } 223 } 224 } 225 if (lineCount) { 226 score += float(matchCount)/float(lineCount)*d->manager.linesQuantification(); 227 score += float(maxMatches)/float(lineCount)*d->manager.sequenceMailQuantification(); 228 score += float(maxMatches)/float(d->body.size())*d->manager.sequencetemplateQuantification(); 229 float emptyLinesFactor = 1.0; 230 if (d->emptyLines > 0){ 231 emptyLinesFactor = (float)emptyLineCounter/(float)d->emptyLines; 232 } 233 if ((d->emptyLines == 0 || emptyLinesFactor > 1.0) && emptyLineCounter > 0){ 234 // swap 235 emptyLinesFactor = (float)d->emptyLines/(float)emptyLineCounter; 236 } 237 score += emptyLinesFactor*d->manager.emptyLinesQuantification(); 238 } 239 return score; 240 } H.3 templatemanager.h 163

241 242 } // namespace SpamTemplates H.3 templatemanager.h

1 # ifndef SPAMTEMPLATES_TEMPLATEMANAGER_H 2 # define SPAMTEMPLATES_TEMPLATEMANAGER_H 3 4 namespace SpamTemplates { 5 6 class TemplateManagerPrivate; 7 8 class TemplateManager 9 { 10 public: 11 TemplateManager(); 12 ~TemplateManager(); 13 14 void setSubjectQuantification(int quantification); 15 void setMailerQuantification(int quantification); 16 void setLinesQuantification(int quantification); 17 void setSequenceMailQuantification(int quantification); 18 void setSequenceTemplateQuantification(int quantification); 19 void setEmptyLinesQuantification(int quantification); 20 void setFuzziness(int fuzziness); 21 22 int subjectQuantification() const; 23 int mailerQuantification() const; 24 int linesQuantification() const; 25 int sequenceMailQuantification() const; 26 int sequencetemplateQuantification() const; 27 int emptyLinesQuantification() const; 28 int fuzziness() const; 29 30 private: 31 TemplateManagerPrivate* const d; 32 }; 33 34 } 35 36 # endif // SPAMTEMPLATES_TEMPLATEMANAGER_H H.4 templatemanager.cpp

1 # include "templatemanager.h" 2 3 namespace SpamTemplates 4 { 5 /************************************************ 6 * TemplateManagerPrivate 7 ************************************************/ 8 class TemplateManagerPrivate { 9 public: 10 TemplateManagerPrivate() 11 : subjectQuantification(0) 12 , mailerQuantification(0) 13 , linesQuantification(0) 14 , sequenceMailQuantification(0) 15 , sequenceTemplateQuantification(0) 16 , emptyLinesQuantification(0) 17 , fuzziness(0) 18 { 19 } 20 int subjectQuantification; 21 int mailerQuantification; 22 int linesQuantification; 23 int sequenceMailQuantification; 24 int sequenceTemplateQuantification; 25 int emptyLinesQuantification; 26 int fuzziness; 27 }; 28 29 /************************************************ 30 * TemplateManager 31 ************************************************/ 32 33 TemplateManager::TemplateManager() 34 : d(new TemplateManagerPrivate()) 35 { 36 } 37 38 TemplateManager::~TemplateManager() 164 H Spam Templates Library

39 { 40 delete d; 41 } 42 43 int TemplateManager::emptyLinesQuantification() const 44 { 45 return d->emptyLinesQuantification; 46 } 47 48 int TemplateManager::linesQuantification() const 49 { 50 return d->linesQuantification; 51 } 52 53 int TemplateManager::mailerQuantification() const 54 { 55 return d->mailerQuantification; 56 } 57 58 int TemplateManager::sequenceMailQuantification() const 59 { 60 return d->sequenceMailQuantification; 61 } 62 63 int TemplateManager::sequencetemplateQuantification() const 64 { 65 return d->sequenceTemplateQuantification; 66 } 67 68 int TemplateManager::subjectQuantification() const 69 { 70 return d->subjectQuantification; 71 } 72 73 int TemplateManager::fuzziness() const 74 { 75 return d->fuzziness; 76 } 77 78 void TemplateManager::setEmptyLinesQuantification(int quantification) 79 { 80 d->emptyLinesQuantification = quantification; 81 } 82 83 void TemplateManager::setLinesQuantification(int quantification) 84 { 85 d->linesQuantification = quantification; 86 } 87 88 void TemplateManager::setMailerQuantification(int quantification) 89 { 90 d->mailerQuantification = quantification; 91 } 92 93 void TemplateManager::setSequenceMailQuantification(int quantification) 94 { 95 d->sequenceMailQuantification = quantification; 96 } 97 98 void TemplateManager::setSequenceTemplateQuantification(int quantification) 99 { 100 d->sequenceTemplateQuantification = quantification; 101 } 102 103 void TemplateManager::setSubjectQuantification(int quantification) 104 { 105 d->subjectQuantification = quantification; 106 } 107 108 void TemplateManager::setFuzziness(int fuzziness) 109 { 110 d->fuzziness = fuzziness; 111 } 112 113 } // namespace H.5 mail.h

1 # ifndef SPAMTEMPLATES_MAIL_H 2 # define SPAMTEMPLATES_MAIL_H 3 # include 4 # include 5 6 namespace SpamTemplates H.6 mail.cpp 165

7 { 8 9 class Mail 10 { 11 public: 12 Mail(); 13 Mail(const std::string &subject, const std::string &mailer, const std::list &body); 14 15 void setSubject(const std::string &subject); 16 void setMailer(const std::string &mailer); 17 void setBody(const std::list &body); 18 const std::string &subject() const; 19 const std::string &mailer() const; 20 const std::list &body() const; 21 22 private: 23 // TODO: d-pointer 24 std::string m_subject; 25 std::string m_mailer; 26 std::list m_body; 27 }; 28 29 } // namespace SpamTemplates 30 31 # endif // SPAMTEMPLATES_MAIL_H H.6 mail.cpp

1 # include "mail.h" 2 3 namespace SpamTemplates 4 { 5 6 Mail::Mail() 7 { 8 } 9 10 Mail::Mail(const std::string &subject, const std::string &mailer, const std::list &body) 11 : m_subject(subject) 12 , m_mailer(mailer) 13 , m_body(body) 14 { 15 } 16 17 void Mail::setBody(const std::list &body) 18 { 19 m_body = body; 20 } 21 22 void Mail::setMailer(const std::string &mailer) 23 { 24 m_mailer = mailer; 25 } 26 27 void Mail::setSubject(const std::string &subject) 28 { 29 m_subject = subject; 30 } 31 32 33 const std::list &Mail::body() const 34 { 35 return m_body; 36 } 37 38 const std::string &Mail::mailer() const 39 { 40 return m_mailer; 41 } 42 43 const std::string &Mail::subject() const 44 { 45 return m_subject; 46 } 47 48 } // namespace SpamTemplates