Analyzing Tor abuse complaints

Rachee Singh Sadia Afroz University of Massachusetts Amherst UC Berkeley, ICSI

1. INTRODUCTION. Cluster Topic Top words We want to understand the characteristics of the network abuse 0 Mixed graphicsoneinc, junkemailfilter, asthma, pascal, originating from Tor. Understanding the abuse of Tor is impor- bob, die, suffers, helo, population, sun tant because website operators and content providers discriminate 1 Mixed password, failed, sshd, whois, timezone, den- against Tor users [3], in many cases, because of the abuse origi- mark, investigate, extracted, invalid, apologies nating from Tor. Akamai observed that an HTTP request from a 2 Googlegroups groups, rubin, googlegroups, broadcast, jimenez, Tor IP is 30 times more likely to be a malicious attack than one jew, steve, posting, satanic, conspiracy that comes from a non-Tor IP [1]. Cloudflare found that 94% of 3 Copyright paramount, copyright, irdeto, cert, infringement, requests from the Tor network are malicious [2]. Abuse control is notice, pgp, compliance, voxility, material crucial for the Tor network itself to maximize the benefit of anony- 4 Sent by nforce mnt, logs, furanet, htdocs, login, wp, sites, query, mous communication networks. To understand the nature of abuse zwiebelfreunde, nforce websites receive through Tor, we look at the complaints exit opera- 5 Sent by Spamcop ip, received, content, , abuse, mail, tors received while running a Tor exit. We assume that every time a thank, message, manager, mso website receives abuse through Tor the administrator of the website 6 Sent by ValueHost valuehost, administrator, index, attempts, mail, complains to the exit operator. So these complaints can be a proxy noc, abuse, yor, disturb, shared for understanding the type and frequency of abuse happen through 7 Sent by Webiron abuse, issues, webiron, mail, clients, service, ip, a Tor exit. In this document, we study a partial set of complaints ticket, time, blacklist sent to [email protected]. Torservers.net runs several high bandwidth Tor exits. We only have access to the complaints sent to one of the exit operators who runs a dozen of Tor exits. Table 1: Result of KMeans clustering Our preliminary analysis of the complaint shows: 1. Over 99% (around 3 million) of the complaints are DMCA complaints. [email protected] 99% (2,971,227) of the complaints are DMCA related complaints. sent over 1 million of the complaints (Figure2). The remaining (approx) 13,000 emails have other complaints and responses from the side of Tor abuse servers. The corpus has 6,971 2. Over 99% of the DMCA complaints mention the use of Bit- non-DMCA complaint emails. Torrent and a few mention eDonkey (Figure1). 2.1 Automatically identify abuse. 3. The DMCA complaints mention 1200 unique IPs. Only 323 We extract the nature of abuse, the time of complaint and the of these IP addresses are related to the Tor network (accord- exit IP being complained about from each mail of the corpus. This ing to the consensuses published since 2010). WHOIS IP is straightforward in emails that are DMCA violations since these lookup shows that 74.3% of the 323 IP addresses listed con- emails have a format that can be easily parsed via regular expres- tact is [email protected]. sions. To extract the relevant abuse information from non-DMCA 4. The majority of the non-DMCA complaints are about brute complaint emails, we followed three steps: apply clustering, search force login attacks on WordPress (Figure3). for regular expressions, and search for relevant terms. 5. spam complaints almost stopped after August 2014 Apply clustering. (Figure5). We consider each email as a document and then vectorise each document based on Term Frequency-Inverse Document Frequency Note that compared to the total traffic of the exit relays, the num- (TF-IDF) values. We perform KMeans clustering on this set of ber of abuse complaints is negligible. The next step of the analysis vectorised documents to obtain 8 clusters (Table1). is to correlate the number of complaints with the exit relay band- Based on the TF-IDF terms, cluster 2 contains googlegroups width and policy. complaints, cluster 3 contains copyright infringement complaints, cluster 4 contains complaints filed by nforce, cluster 5 contains 2. THE COMPLAINT EMAILS. complaints filed by SpamCop, cluster 6 contains ValueHost com- The dataset consists of a partial set of complaints sent to abuse@ plaints, cluster 7 contains Webiron complaints. Clusters 0 and 1 torservers.net from June 2010 to April 2016. There are ap- seem to be a mixed bag of complaints. Thankfully none of the proximately 3 million complaint emails in this dataset. Of these, profanity showed up in top 10 terms. While we can make guesses about the type of abuse being com- 125000 plained about in the emails based on the cluster it belongs to, clus- tering of not perfect. In addition to this, clusters 0 and 1 seem to be mixed with emails of different types. In order to deal with the 100000 Complainer case of mis-clustered emails and emails in undefined clusters (clus- [email protected] activision@copyright−compliance.com ter 0 and 1), we further process the corpus for extracting the type [email protected]−notice.com broadgreenpictures@copyright−compliance.com of abuse. 75000 cengage@copyright−compliance.com graphicaudio_p2p@copyright−compliance.com [email protected] Search for regular expressions. [email protected]−notice.com notices.p2p@ip−echelon.com notices.warner@ip−echelon.com For all emails sent within a cluster, we enumerate regular ex- 50000 paramount@copyright−compliance.com pressions that capture the format of the email. For instance, if an Number of DMCA complaints sbstv@copyright−compliance.com starz_media@copyright−compliance.com email in the copyright complaint cluster is sent by Voxility, we use a teachingcompany@copyright−compliance.com zenimax_media@copyright−compliance.com regular expression to capture the type of abuse. The regular expres- 25000 sions are produced after manually observing the formats of emails received. Some senders have an email format while some sends free form text. To handle the free-form text, we use the last step. 0 For emails in undefined clusters, we look for server logs (HTTP

GET/POST request logs, SSH login fail logs etc) to infer the kind 2014 − Jul 2015 − Jul 2014 − Apr 2014 − Oct 2015 − Apr 2015 − Oct 2016 − Apr 2014 − Jan 2014 − Jun 2015 − Jan 2015 − Jun 2016 − Jan 2014 − Feb 2015 − Feb 2016 − Feb 2014 − Mar 2015 − Mar 2016 − Mar 2014 − Aug 2015 − Aug 2014 − Nov 2015 − Nov 2014 − Dec 2015 − Dec 2014 − Sep 2015 − Sep 2014 − May 2015 − May of abuse. Year and Month

Search for relevant terms. Figure 2: Number of complaints by each complainer, per week If the previous steps fail to find the kind of abuse, we look for (for DMCA complaints). terms like port scanning and email hacked to infer the kind of abuse that the email could be complaining about. Inferences made using this step are likely to be erroneous. We also look at the distribution of various types of non-copyright abuse over time (Figure4) (all WordPress plugin exploits combined 3. RESULTS. to one category). We noticed a spike in complaints of brute force WordPress logins at the end of 2015. Another interesting point is 3.1 Analysing DMCA complaints that complaints stopped after August 2014 (Figure5). We first look at the distribution of the DMCA complaints over time (Figure2). The 3 million DMCA complaints refer to 1200 Acknowledgment unique exit relay IPs. In addition, the complaints highlight the use We thank Moritz Bartl for sharing the abuse complaints with us. of BitTorrent or eDonkey for accessing content illegally (Figure1). Interestingly, most of these 1 million complaints are originated 4. REFERENCES from the same sender [email protected] (Figure [1] akamai’s [state of the internet] / security, Q2 2015 report. 2). [2] The Trouble with Tor. [3] S. Khattak, D. Fifield, S. Afroz, M. Javed, S. Sundaresan, V. Paxson, S. J. Murdoch, and D. McCoy. Do you see what i see?: Differential 15000 treatment of anonymous users. In Network and Distributed System Security Symposium 2016. IETF, 2016.

10000 Source/Protocol eDonkey Bittorrent

5000 Number of DMCA complaints per exit IP Number of DMCA complaints per exit 0 0 250 500 750 1000 1250 Exit Relay IPs

Figure 1: Number of DMCA complaints per exit relay IP and the protocol used for abuse by each.

3.2 Analysing non-DMCA complaints These complaints include complaints from individuals and com- panies about Spam, intrusions an various other kinds of attacks. Based on the 3-step approach, we identified the type of abuse. From the around 7000 original emails, we were not able to classify ap- proximately 800 emails. For these, the type of abuse is ‘other’ in our analysis. We classified these manually. The largest num- ber of complaints are about brute force login attacks on WordPress followed by the general category of malicious access of servers (in- cludes complaints of intrusion, and brute force HTTP POSTs) (Fig- ure3). 1200

800

400 Number of abuse complaints Number of abuse

0 Other Email Spam verbal threats verbal port scanning email hacking Known Botnet Known probing domain trademark violations Google Groups spam bruteforce login attempt bruteforce abusive posts on usenet abusive use of stolen credit cards URL Redirection Link Spam Vulnerability Target Scan Bot Target Vulnerability WordPress Login Brute Force WordPress general spam (comment or email) general Semalt Referrer Spam Tor Exit Bot Spam Tor Semalt Referrer blocking all tor secnap emerging threats blocking WordPress Login Brute Force (XMLRPC) Login Brute Force WordPress WordPress Login Page Scan Bot(Tor Exit) Scan Bot(Tor Login Page WordPress WordPress Vulnerability Scanner Bot(Tor Exit) Scanner Bot(Tor Vulnerability WordPress Host banned for attempting to upload malware. Host banned for WordPress "wpshop" Plugin Uploader Exploiter WordPress WordPress gf_page Upload Exploit Bot (Tor Exit) gf_page Upload Exploit Bot (Tor WordPress WordPress "reflex−gallery" Plugin Update Exploiter WordPress WordPress "dzs−zoomsounds" Plugin Update Exploiter "dzs−zoomsounds" WordPress malicious access of server (intrusion/bruteforce access) WordPress Config Dump Vulnerability Exploiter(Tor Exit) Exploiter(Tor Config Dump Vulnerability WordPress WordPress "sexy−contact−form" Plugin Upload Exploiter "sexy−contact−form" WordPress WordPress "wpDataTable" side−load exploit Bot (Tor exit) Bot (Tor side−load exploit "wpDataTable" WordPress Host was detected using post spam. AKA Comment spam Host was WordPress "revslider_ajax_action" Plugin Update Exploiter "revslider_ajax_action" WordPress WordPress "pagelines_test_ajax" Settings Exploiter(Tor Exit) "pagelines_test_ajax" Settings Exploiter(Tor WordPress WordPress "simple−ads−manager" Plugin Uploader Exploiter WordPress WordPress "com_miwoftp wp−config.php download" Exploit Bot wp−config.php download" "com_miwoftp WordPress WordPress (nm_webcontact/jquery−file−Upload) Plugin Uploader Exploiter WordPress WordPress AJAX widgets_init/UPCP_AddProductSpreadsheet Upload Exploiter AJAX WordPress Type of abuse

Figure 3: Types of non-copyright abuse complaints and their popularity 300

200

100 Number of non−DMCA abuse complaints Number of non−DMCA abuse

0 2010−Jul 2011−Jul 2012−Jul 2012−Jul 2013−Jul 2014−Jul 2015−Jul 2010−Oct 2011−Apr 2011−Oct 2012−Apr 2012−Oct 2013−Apr 2013−Oct 2014−Apr 2014−Oct 2015−Apr 2015−Oct 2016−Apr 2010−Jun 2011−Jan 2011−Jun 2012−Jan 2012−Jun 2013−Jan 2013−Jun 2014−Jan 2014−Jun 2015−Jan 2015−Jun 2016−Jan 2011−Feb 2012−Feb 2013−Feb 2014−Feb 2015−Feb 2016−Feb 2011−Mar 2012−Mar 2013−Mar 2014−Mar 2015−Mar 2016−Mar 2010−Aug 2011−Aug 2012−Aug 2013−Aug 2014−Aug 2015−Aug 2010−Nov 2011−Nov 2012−Nov 2013−Nov 2014−Nov 2015−Nov 2010−Dec 2011−Dec 2012−Dec 2013−Dec 2014−Dec 2015−Dec 2010−Sep 2011−Sep 2012−Sep 2013−Sep 2014−Sep 2015−Sep 2011−May 2012−May 2013−May 2014−May 2015−May Year and Month

Email Spam URL Redirection Link Spam WordPress Login Brute Force blocking all tor secnap emerging threats trademark violations

Google Groups spam Vulnerability Target Scan Bot WordPress Login Brute Force (XMLRPC) bruteforce login attempt use of stolen credit cards

Host banned for attempting to upload malware. WordPress "com_miwoftp wp−config.php download" Exploit Bot WordPress Login Page Scan Bot(Tor Exit) email hacking verbal threats

Abuse Type Host was detected using post spam. AKA Comment spam WordPress "pagelines_test_ajax" Settings Exploiter(Tor Exit) WordPress Plugin Exploit general spam (comment or email)

Known Botnet WordPress "wpDataTable" side−load exploit Bot (Tor exit) WordPress Vulnerability Scanner Bot(Tor Exit) malicious access of server (intrusion/bruteforce access)

Other WordPress AJAX widgets_init/UPCP_AddProductSpreadsheet Upload Exploiter WordPress gf_page Upload Exploit Bot (Tor Exit) port scanning

Semalt Referrer Spam Tor Exit Bot WordPress Config Dump Vulnerability Exploiter(Tor Exit) abusive posts on usenet probing domain

Figure 4: Distribution of different types of non-copyright complaints over time.

60

40

20 Number of Email Spam complaints 0 2010−Jul 2011−Jul 2012−Jul 2013−Jul 2014−Jul 2015−Jul 2010−Apr 2010−Oct 2011−Apr 2011−Oct 2012−Apr 2012−Oct 2013−Apr 2013−Oct 2014−Apr 2014−Oct 2015−Apr 2015−Oct 2016−Apr 2010−Jan 2010−Jan 2010−Jun 2011−Jan 2011−Jun 2012−Jan 2012−Jun 2013−Jan 2013−Jun 2014−Jan 2014−Jun 2015−Jan 2015−Jun 2016−Jan 2011−Feb 2012−Feb 2013−Feb 2014−Feb 2015−Feb 2016−Feb 2010−Mar 2011−Mar 2012−Mar 2013−Mar 2014−Mar 2015−Mar 2016−Mar 2010−Aug 2011−Aug 2012−Aug 2013−Aug 2014−Aug 2015−Aug 2010−Nov 2011−Nov 2012−Nov 2013−Nov 2014−Nov 2015−Nov 2010−Dec 2011−Dec 2012−Dec 2013−Dec 2014−Dec 2015−Dec 2015−Dec 2010−Sep 2011−Sep 2012−Sep 2013−Sep 2014−Sep 2015−Sep 2010−May 2010−May 2011−May 2012−May 2013−May 2014−May 2015−May 2016−May Year and Month

Figure 5: The end of email spam complaints from Tor